Dates fill in for date range and fillna - python

I am querying my database to show records from the past week. I am then aggregating the data and transposing it in python and pandas into a DataFrame.
In this table I am attempting to show what occurred on each day in the past 7 week, however, on some days no events occur. In these cases, the date is missing altogether. I am looking for an approach to append the dates that are not present (but are part of the date range specified in the query) so that I can then fillna with any value I wish for the other missing columns.
In some trials I have the data set into a pandas Dataframe where the dates are the index and in others the dates are a column. I am preferably looking to have the dates as the top index - so group by name, stack purchase and send_back and dates are the 'columns'.
Here is an example of how the dataframe looks now and what I am looking for:
Dates set in query for - 01.08.2016 - 08.08.2016. The dataframe looks liks so:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 04.08.2016 Sarah 55 0
3 05.08.2016 Michael 80 20
4 07.08.2016 Sarah 130 0
After:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 03.08.2016 - 0 0
3 04.08.2016 Sarah 55 0
4 05.08.2016 Michael 80 20
5 06.08.2016 - 0 0
6 07.08.2016 Sarah 130 0
7 08.08.2016 Sarah 0 35
8 08.08.2016 Michael 20 0
Printing the following:
df.index
gives:
'Index([ u'dates',u'name',u'purchase',u'send_back'],
dtype='object')
RangeIndex(start=0, stop=1, step=1)'
I appreciate any guidance.

assuming you have the following DF:
In [93]: df
Out[93]:
name purchase send_back
dates
2016-08-01 Michael 120 0
2016-08-02 Sarah 100 40
2016-08-04 Sarah 55 0
2016-08-05 Michael 80 20
2016-08-07 Sarah 130 0
you can resample and replace:
In [94]: df.resample('D').first().replace({'name':{np.nan:'-'}}).fillna(0)
Out[94]:
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0

Your index is of object type and you must convert it to datetime format.
# Converting the object date to datetime.date
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
# Setting the index column
df.set_index(['dates'], inplace=True)
# Choosing a date range extending from first date to the last date with a set frequency
new_index = pd.date_range(start=df.index[0], end=df.index[-1], freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
# Making the required modifications
df.ix[:,0], df.ix[:,1:] = df.ix[:,0].fillna('-'), df.ix[:,1:].fillna(0)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0
Let's suppose you have data for a single day (as mentioned in the comments section) and you would like to fill the other days of the week with null values:
Data Setup:
df = pd.DataFrame({'dates':['01.08.2016'], 'name':['Michael'],
'purchase':[120], 'send_back':[0]})
print (df)
dates name purchase send_back
0 01.08.2016 Michael 120 0
Operations:
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
df.set_index(['dates'], inplace=True)
# Setting periods as 7 to account for the end of the week
new_index = pd.date_range(start=df.index[0], periods=7, freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 NaN NaN NaN
2016-08-03 NaN NaN NaN
2016-08-04 NaN NaN NaN
2016-08-05 NaN NaN NaN
2016-08-06 NaN NaN NaN
2016-08-07 NaN NaN NaN
Incase you want to fill the null values with 0's, you could do:
df.fillna(0, inplace=True)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 0 0.0 0.0
2016-08-03 0 0.0 0.0
2016-08-04 0 0.0 0.0
2016-08-05 0 0.0 0.0
2016-08-06 0 0.0 0.0
2016-08-07 0 0.0 0.0

Related

Group by month from a particular date python

I have a DataFrame of account statement that contains date, debit and credit.
Lets just say salary gets deposited every 20th of the month.
I want to groupby date column from every 20th of each month to find sum of debits and credits. For e.g., 20th Jan to 20th Feb and so on.
date_parsed Debit Credit
0 2020-05-02 775.0 0.0
1 2020-04-30 209.0 0.0
2 2020-04-24 5000.0 0.0
3 2020-04-24 25000.0 0.0
... ... ... ...
79 2020-04-20 750.0 0.0
80 2020-04-15 5000.0 0.0
81 2020-04-13 0.0 2283.0
82 2020-04-09 0.0 6468.0
83 2020-04-03 0.0 1000.0
I am not sure but pd.offsett can be used with groupby.
You could add an extra month column which truncates up or down based on the day of month. Then its just groupby and sum. E.g. month 2020-06 would include dates between 2020-05-20 and 2020-06-19.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date_parsed': ['2020-05-02', '2020-05-03', '2020-05-20', '2020-05-22'], 'Credit': [1,2,3,4], 'Debit': [5,6,7,8]})
df['date'] = pd.to_datetime(df.date_parsed)
df['month'] = np.where(df.date.dt.day < 20, df.date.dt.to_period('M'), (df.date + pd.DateOffset(months=1)).dt.to_period('M'))
print(df[['month', 'Credit', 'Debit']].groupby('month').sum().reset_index())
Input:
date_parsed Credit Debit
0 2020-05-02 1 5
1 2020-05-03 2 6
2 2020-05-20 3 7
3 2020-05-22 4 8
Result:
month Credit Debit
0 2020-05 3 11
1 2020-06 7 15

How to transform many_days to daily in python?

I have the following dataframe:
import pandas as pd
dt = pd.DataFrame({'start_date': ['2019-05-20', '2019-05-21', '2019-05-21'],
'end_date': ['2019-05-23', '2019-05-24', '2019-05-22'],
'reg': ['A', 'B','A'],
'measure': [100, 200,1000]})
I would to create a new column, called 'date', which will have values from start_date until end_date and also have a new column measure_daily which will be the measure spread equally among these dates.
So basically, I would like to expand the dt in terms of rows
So I would like the final df to look like:
dt_f = pd.DataFrame({'date':['2019-05-20','2019-05-21','2019-05-22','2019-05-23','2019-05-21','2019-05-22','2019-05-23','2019-05-24', '2019-05-21','2019-05-22'],
'reg':['A','A','A','A','B','B','B','B','A','A'],
'measure_daily':[25,25,25,25,50,50,50,50,500,500]})
Is there an efficient way to do this in python ?
TL;DR
just give me the solution:
dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)
melt = pd.concat(
[d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()
melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)
Breakdown:
First we melt your start and end date to the same column:
dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)
reg measure key date
0 A 100 0 2019-05-20
1 B 200 1 2019-05-21
2 A 1000 2 2019-05-21
3 A 100 0 2019-05-23
4 B 200 1 2019-05-24
5 A 1000 2 2019-05-22
Then we resample on daily basis while applying groupby to keep the different reg in their own group.
melt = pd.concat(
[d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()
date reg measure key
0 2019-05-20 A 100.0 0.0
1 2019-05-21 A 100.0 0.0
2 2019-05-22 A 100.0 0.0
3 2019-05-23 A 100.0 0.0
4 2019-05-21 B 200.0 1.0
5 2019-05-22 B 200.0 1.0
6 2019-05-23 B 200.0 1.0
7 2019-05-24 B 200.0 1.0
8 2019-05-21 A 1000.0 2.0
9 2019-05-22 A 1000.0 2.0
Finally we spread out the measure column over the size of each group with assign:
melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)
date reg measure
0 2019-05-20 A 25.0
1 2019-05-21 A 25.0
2 2019-05-22 A 25.0
3 2019-05-23 A 25.0
4 2019-05-21 B 50.0
5 2019-05-22 B 50.0
6 2019-05-23 B 50.0
7 2019-05-24 B 50.0
8 2019-05-21 A 500.0
9 2019-05-22 A 500.0

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

Pandas fill in missing date within each group with information in the previous row

Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0

Fill missing dates

I have dataframe contains temperature readings from different areas and in different dates
I want to add the missing dates for each location with zero temperature
for example:
df=pd.DataFrame({"area_id":[1,1,1,2,2,2,3,3,3],
"reading_date":["13/1/2017","15/1/2017"
,"16/1/2017","22/3/2017","26/3/2017"
,"28/3/2017","15/5/2017"
,"16/5/2017","18/5/2017"],
"temp":[12,15,22,6,14,8,30,25,33]})
What is the most efficient way to fill dates gap per area (by zeros) as shown below
Many Thanks.
Use:
first convert to datetime column reading_date by to_datetime
set_index for DatetimeIndex and groupby with resample
for Series add asfreq
replace NaNs by fillna
last add reset_index for columns from MultiIndex
df['reading_date'] = pd.to_datetime(df['reading_date'])
df = (df.set_index('reading_date')
.groupby('area_id')
.resample('d')['temp']
.asfreq()
.fillna(0)
.reset_index())
print (df)
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0
Using reindex. Define a custom function to handle the reindexing operation, and call it inside groupby.apply.
def reindex(x):
# Thanks to #jezrael for the improvement.
return x.reindex(pd.date_range(x.index.min(), x.index.max()), fill_value=0)
Next, convert reading_date to datetime first, using pd.to_datetime,
df.reading_date = pd.to_datetime(df.reading_date)
Now, perform a groupby.
df = (
df.set_index('reading_date')
.groupby('area_id')
.temp
.apply(reindex)
.reset_index()
)
df.columns = ['area_id', 'reading_date', 'temp']
df
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Categories

Resources