Conditional groupby on dates using pandas - python

I'm in a bit of a pickle. I've been working on a problem all day without seeing any real results. I'm working in Python and using Pandas for handling data.
What I'm trying to achieve is based on the customers previous interactions to sum each type of interaction. The timestamp of the interaction should be less than the timestamp of the survey. Ideally, I would like to sum the interactions for the customer during some period - like less than e.g. 5 years.
The first dataframe contains a customer ID, segmentation of that customer during in that survey e.g. 1 being "happy", 2 being "sad" and a timestamp for the time of the recorded segment or time of that survey.
import pandas as pd
#Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})
customers
Which yields something like:
customerID
customerSeg
timestamp
1
1
1999-01-01
1
1
2000-01-01
1
1
2000-06-01
2
2
2001-01-01
2
2
2003-01-01
3
3
1999-01-01
4
4
2005-01-01
4
4
2008-01-01
The other dataframe contains interactions with that customer eg. at service and a phonecall.
interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
interactions
Output:
customerID
timestamp
service
phonecall
1
1999-07-01
1
0
1
1999-11-01
0
1
1
2000-03-01
1
1
1
2001-04-01
0
1
2
2000-12-01
1
1
2
2002-01-01
0
1
2
2004-03-01
1
0
2
2004-05-01
1
1
4
2000-01-01
0
1
4
2004-01-01
1
0
4
2009-01-01
1
1
Result for all previous interactions (ideally, I would like only the last 5 years):
customerID
customerSeg
timestamp
service
phonecall
1
1
1999-01-01
0
0
1
1
2000-01-01
1
1
1
1
2000-06-01
2
2
2
2
2001-01-01
1
1
2
2
2003-01-01
1
2
3
3
1999-01-01
0
0
4
4
2005-01-01
1
1
4
4
2008-01-01
1
1
I've tried almost everything, I could come up with. So, I would really appreciate some inputs. I'm pretty much confined to using Pandas and Python, since it's the language, I'm most familiar with, but also because I need to read a csv file of the customer segmentation.

I think you need several steps for transforming your data.
First of all, we convert the timestamp columns in both dataframes to datetime, so we can calculate the desired interval and do the comparisons:
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
After that, we create a new column that contains that start date (e.g. 5 years before the timestamp):
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
Now we join the customers dataframe with the interactions dataframe on the customerID:
result = customers.merge(interactions, on='customerID', how='outer')
This yields
customerID customerSeg timestamp_x start_date timestamp_y service phonecall
0 1 1 1999-01-01 1994-01-01 1999-07-01 1.0 0.0
1 1 1 1999-01-01 1994-01-01 1999-11-01 0.0 1.0
2 1 1 1999-01-01 1994-01-01 2000-03-01 1.0 1.0
3 1 1 1999-01-01 1994-01-01 2001-04-01 0.0 1.0
4 1 2 2000-01-01 1995-01-01 1999-07-01 1.0 0.0
5 1 2 2000-01-01 1995-01-01 1999-11-01 0.0 1.0
6 1 2 2000-01-01 1995-01-01 2000-03-01 1.0 1.0
7 1 2 2000-01-01 1995-01-01 2001-04-01 0.0 1.0
...
Now here is how the condition is evaluated - what we want is that only those service and phonecall interactions will be used that are in rows that meet the condition (timestamp_y is in the interval between start_date and timestamp_x), so we replace the others by zero:
result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
Finally we group the dataframe, summing up the service and phonecall interactions:
result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()
Result:
service phonecall
customerID timestamp_x customerSeg
1 1999-01-01 1 0.0 0.0
2000-01-01 2 1.0 1.0
2000-06-01 2 2.0 2.0
2 2001-01-01 1 1.0 1.0
2003-01-01 2 1.0 2.0
3 1999-01-01 3 0.0 0.0
4 2005-01-01 3 1.0 1.0
2008-01-01 3 1.0 0.0
(Note that your customerSeg data in the sample code seems not quite to match the data in the table.)

One option is to use the conditional_join from pyjanitor to compute the rows that match the criteria, before grouping and summing:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
(interactions
.conditional_join(
customers,
# column from left, column from right, comparison operator
('timestamp', 'timestamp', '<='),
('timestamp', 'start_date', '>='),
('customerID', 'customerID', '=='),
how='right')
# drop irrelevant columns
.drop(columns=[('left', 'customerID'),
('left', 'timestamp'),
('right', 'start_date')])
# return to single index
.droplevel(0,1)
.groupby(['customerID', 'customerSeg', 'timestamp'])
.sum()
)
service phonecall
customerID customerSeg timestamp
1 1 1999-01-01 0.0 0.0
2 2000-01-01 1.0 1.0
2000-06-01 2.0 2.0
2 1 2001-01-01 1.0 1.0
2 2003-01-01 1.0 2.0
3 3 1999-01-01 0.0 0.0
4 3 2005-01-01 1.0 1.0
2008-01-01 1.0 0.0

Related

Subtract one column from another in pandas - with a condition

I have this code that will subtract, for each person (AAC or AAB), timepoint 1 from time point 2 data.
i.e this is the original data:
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 1 2.0 NaN 4.0
1 4 3 2.0 6.0 NaN
2 4 3 NaN 6.0 NaN
3 4 5 2.0 6.0 NaN
This is the code:
import sys
import numpy as np
from sklearn.metrics import auc
import pandas as pd
from numpy import trapz
#read in file
df = pd.DataFrame([[0,1,2,np.nan,4],[4,3,2,6,np.nan],[4,3,np.nan,6,np.nan],[4,5,2,6,np.nan]],columns=['pep_seq','AAC-T01','AAC-T02','AAB-T01','AAB-T02'])
#standardise the data by taking T0 away from each sample
df2 = df.drop(['pep_seq'],axis=1)
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"]))
df2.insert(0,'pep_seq',df['pep_seq'])
print(df)
print(df2)
This is the output (i.e. df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN NaN
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
...but what I actually wanted was to subtract the T01 columns from all the others EXCEPT for when the T01 value is NaN in which case keep the original value, so the desired output was (see the 4.0 in AAB-T02):
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0 NaN
2 4 0 NaN 0 NaN
3 4 0 -3.0 0 NaN
Could someone show me where I went wrong? Note that in real life, there are ~100 timepoints per person, not just two.
You can fill the nan to 0 when doing subtraction
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"].fillna(0)))
# ^^^^ Changes here
df2.insert(0,'pep_seq',df['pep_seq'])
print(df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
I hope that I understand you correctly but numpy.where() should do it for you.
Have a look here: condition based substraction

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

merging two dataframes together with similar column values [duplicate]

This question already has answers here:
Combine two pandas Data Frames (join on a common column)
(4 answers)
Closed 4 years ago.
I have two dfs, one is longer than the other but they both have one column that contain the same values.
Here is my first df called weather:
DATE AWND PRCP SNOW WT01 WT02 TAVG
0 2017-01-01 5.59 0.00 0.0 NaN NaN 46
1 2017-01-02 9.17 0.21 0.0 1.0 NaN 40
2 2017-01-03 10.74 0.58 0.0 1.0 NaN 42
3 2017-01-04 8.05 0.00 0.0 1.0 NaN 47
4 2017-01-05 7.83 0.00 0.0 NaN NaN 34
Here is my 2nd df called bike:
DATE LENGTH ID AMOUNT
0 2017-01-01 3 1 5
1 2017-01-01 6 2 10
2 2017-01-02 9 3 100
3 2017-01-02 12 4 250
4 2017-01-03 15 5 45
So I want my df to copy over all rows from the weather df based upon the shared DATE column and copy it over.
DATE LENGTH ID AMOUNT AWND SNOW TAVG
0 2017-01-01 3 1 5 5.59 0 46
1 2017-01-01 6 2 10 5.59 0 46
2 2017-01-02 9 3 100 9.17 0 40
3 2017-01-02 12 4 250 9.17 0 40
4 2017-01-03 15 5 45 10.74 0 42
Please help! Maybe some type of join can be used.
Use merge
In [93]: bike.merge(weather[['DATE', 'AWND', 'SNOW', 'TAVG']], on='DATE')
Out[93]:
DATE LENGTH ID AMOUNT AWND SNOW TAVG
0 2017-01-01 3 1 5 5.59 0.0 46
1 2017-01-01 6 2 10 5.59 0.0 46
2 2017-01-02 9 3 100 9.17 0.0 40
3 2017-01-02 12 4 250 9.17 0.0 40
4 2017-01-03 15 5 45 10.74 0.0 42
Just use the same indexes and simple slicing
df2 = df2.set_index('DATE')
df2[['SNOW', 'TAVG']] = df.set_index('DATE')[['SNOW', 'TAVG']]
If you check the pandas docs, they explain all the different types of "merges" (joins) that you can do between two dataframes.
The common syntax for a merge looks like: pd.merge(weather, bike, on= 'DATE')
You can also make the merge more fancy by adding any of the arguments to your merge function that I listed below: (e.g specifying whether your want an inner vs right join)
Here are the arguments the function takes based on the current pandas docs:
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Source
Hope it helps!

Pandas fill in missing date within each group with information in the previous row

Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0

Fill missing dates

I have dataframe contains temperature readings from different areas and in different dates
I want to add the missing dates for each location with zero temperature
for example:
df=pd.DataFrame({"area_id":[1,1,1,2,2,2,3,3,3],
"reading_date":["13/1/2017","15/1/2017"
,"16/1/2017","22/3/2017","26/3/2017"
,"28/3/2017","15/5/2017"
,"16/5/2017","18/5/2017"],
"temp":[12,15,22,6,14,8,30,25,33]})
What is the most efficient way to fill dates gap per area (by zeros) as shown below
Many Thanks.
Use:
first convert to datetime column reading_date by to_datetime
set_index for DatetimeIndex and groupby with resample
for Series add asfreq
replace NaNs by fillna
last add reset_index for columns from MultiIndex
df['reading_date'] = pd.to_datetime(df['reading_date'])
df = (df.set_index('reading_date')
.groupby('area_id')
.resample('d')['temp']
.asfreq()
.fillna(0)
.reset_index())
print (df)
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0
Using reindex. Define a custom function to handle the reindexing operation, and call it inside groupby.apply.
def reindex(x):
# Thanks to #jezrael for the improvement.
return x.reindex(pd.date_range(x.index.min(), x.index.max()), fill_value=0)
Next, convert reading_date to datetime first, using pd.to_datetime,
df.reading_date = pd.to_datetime(df.reading_date)
Now, perform a groupby.
df = (
df.set_index('reading_date')
.groupby('area_id')
.temp
.apply(reindex)
.reset_index()
)
df.columns = ['area_id', 'reading_date', 'temp']
df
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Categories

Resources