How to interpolate grouped time series in a Pandas dataframe - python

I have data in type pd.DataFrame which looks like the following:
type date sum
A Jan-1 1
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-5 6
The task is to build a continuous time series for each type (the missing date should be filled with 0).
The expected result is:
type date sum
A Jan-1 1
A Jan-2 0
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-3 0
B Feb-4 0
B Feb-5 6
Is it possible to do that with pandas or other Python tools?
The real dataset has millions of rows.

You first must change your date to a datetime and put that column in the index to take advantage of resampling and then you can convert the date back to its original format
# change to datetime
df['date'] =pd.to_datetime(df.date, format="%b-%d")
df = df.set_index('date')
# resample to fill in missing dates
df1 = df.groupby('type').resample('d')['sum'].asfreq().fillna(0)
df1 = df1.reset_index()
# change back to original date format
df1['date'] = df1.date.dt.strftime('%b-%d')
output
type date sum
0 A Jan-01 1.0
1 A Jan-02 0.0
2 A Jan-03 2.0
3 B Feb-01 1.0
4 B Feb-02 3.0
5 B Feb-03 0.0
6 B Feb-04 0.0
7 B Feb-05 6.0

Related

Python pandas to filter the data based on date range in ascending order

I'm loading csv file and It has three columns: a column with date and time, a column with a value, and another 'data'. Example rows:
value data Date-Time
0 2 a 2019-3-18 23:11:00
1 3 b 2019-10-24 21:00:12
2 1 c 2019-1-10 23:00:00
3 2 d 2019-4-18 23:11:00
4 1 e 2019-1-1 23:00:00
I want group by value if we get duplicates on value need to fetch record based on recent record of date and time it should look as follows.
value data date
0 1 c 2019-1-10 23:00:00
1 2 d 2019-04-18 23:11:00
2 3 b 2019-10-24 21:00:12
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date").groupby(['value'], as_index=False).first()
print(df)
Use sort_values and drop_duplicates:
# Convert 'Date-Time' column to datetime64
# df['Date-Time'] = pd.to_datetime(df['Date-Time'])
>>> df.sort_values('Date-Time') \
.drop_duplicates('value', keep='last') \
.sort_values('value')
value data Date-Time
2 1 c 2019-01-10 23:00:00
3 2 d 2019-04-18 23:11:00
1 3 b 2019-10-24 21:00:12

Expanding multi-indexed dataframe with new dates as forecast

Note: I have followed Stackoverflow's instruction of how to create MRE and paste the MRE into 'code block' as instructed (i.e. paste it in the Body and then press Ctrl+K when highlighting it). If I am still not doing it correctly, let me know.
Back to question: Suppose I now have a df multi-indexed in both the date (df['DT']) and ID (df['ID'])
DT,ID,value1,value2
2020-10-01,a,1,1
2020-10-01,b,2,1
2020-10-01,c,3,1
2020-10-01,d,4,1
2020-10-02,a,10,1
2020-10-02,b,11,1
2020-10-02,c,12,1
2020-10-02,d,13,1
df = df.set_index(['DT','ID'])
And now, I want to expand the df to have '2020-10-03' and '2020-10-04' with the same set of ID {a,b,c,d} as my forecast period. To forecast value 1, I assume they will take the average of the existing values, e.g. for a's value1 in both 2020-10-03' and '2020-10-04', I assume it will take (1+10)/2 = 5.5. For value 2, I assume it will stay constant as 1.
The expected df will look like this:
DT,ID,value1,value2
2020-10-01,a,1.0,1
2020-10-01,b,2.0,1
2020-10-01,c,3.0,1
2020-10-01,d,4.0,1
2020-10-02,a,10.0,1
2020-10-02,b,11.0,1
2020-10-02,c,12.0,1
2020-10-02,d,13.0,1
2020-10-03,a,5.5,1
2020-10-03,b,6.5,1
2020-10-03,c,7.5,1
2020-10-03,d,8.5,1
2020-10-04,a,5.5,1
2020-10-04,b,6.5,1
2020-10-04,c,7.5,1
2020-10-04,d,8.5,1
Appreciate your help and time.
For easy forecast with mean use DataFrame.unstack for DatetimeIndex, add next datetimes by DataFrame.reindex with date_range and then replace missing values in value1 level by DataFrame.fillna and for value2 is set 1, last reshape back by DataFrame.stack:
print (df)
value1 value2
DT ID
2020-10-01 a 1 1
b 2 1
c 3 1
d 4 1
2020-10-02 a 10 1
b 11 1
c 12 1
d 13 1
rng = pd.date_range('2020-10-01','2020-10-04', name='DT')
df1 = df.unstack().reindex(rng)
df1['value1'] = df1['value1'].fillna(df1['value1'].mean())
df1['value2'] = 1
df2 = df1.stack()
print (df2)
value1 value2
DT ID
2020-10-01 a 1.0 1
b 2.0 1
c 3.0 1
d 4.0 1
2020-10-02 a 10.0 1
b 11.0 1
c 12.0 1
d 13.0 1
2020-10-03 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
2020-10-04 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
But forecasting is more complex, you can check this

Add pandas Series as new columns to a specific Dataframe row

Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN

Pandas MultiIndex Aggregation

I am trying to do some aggregation on a multi-indexDataFrame based on a DatetimeIndex generated from pandas.date_range.
My DatetimeIndex looks like this:
DatetimeIndex(['2000-05-30', '2000-05-31', '2000-06-01' ... '2001-1-31'])
And my multi-index DateFrame looks like this:
value
date id
2000-05-31 1 0
2 1
3 1
2000-06-30 2 1
3 0
4 0
2000-07-30 2 1
4 0
1 0
2002-09-30 1 1
3 1
The dates in the DatetimeIndex may or may not be in the date index.
I need to retrieve all the id such that the percentage of value==1 is greater than or equal to some decimal threshold e.g. 0.6 for all the rows where the date for that id is in the DatetimeIndex.
For example if the threshold is 0.5, then the output should be [2, 3] or some DataFrame containing 2 and 3.
1 does not meet the requirement because 2002-09-30 is not in the DatetimeIndex.
I have a solution with loops and dictonaries to keep track of how often value==1 for each id, but it runs very slowly.
How can I utilize pandas to perform this aggregation?
Thank you.
You can use:
#define range
rng = pd.date_range('2000-05-30', '2000-7-01')
#filtering with isin
df = df[df.index.get_level_values('date').isin(rng)]
#get all treshes
s = df.groupby('id')['value'].mean()
print (s)
id
1 0.0
2 1.0
3 0.5
4 0.0
Name: value, dtype: float64
#get all values of index by tresh
a = s.index[s >= 0.5].tolist()
print (a)
[2, 3]

pandas pivot table for heatmap

I am trying to generate a heatmap using seaborn, however I am having a small problem with the formatting of my data.
Currently, my data is in the form:
Name Diag Date
A 1 2006-12-01
A 1 1994-02-12
A 2 2001-07-23
B 2 1999-09-12
B 1 2016-10-12
C 3 2010-01-20
C 2 1998-08-20
I would like to create a heatmap (preferably in python) showing Name on one axis against Diag - if occured. I have tried to pivot the table using pd.pivot, however I was given the error
ValueError: Index contains duplicate entries, cannot reshape
this came from:
piv = df.pivot_table(index='Name',columns='Diag')
Time is irrelevant, but I would like to show which Names have had which Diag, and which Diag combos cluster together. Do I need to create a new table for this or is it possible for that I have? In some cases the Name is not associated with all Diag
EDIT:
I have since tried:
piv = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
However as Time is in datetime format, I end up with:
pandas.core.base.DataError: No numeric types to aggregate
You need pivot_table with some aggregate function, because for same index and column have multiple values and pivot need unique values only:
print (df)
Name Diag Time
0 A 1 12 <-duplicates for same A, 1 different value
1 A 1 13 <-duplicates for same A, 1 different value
2 A 2 14
3 B 2 18
4 B 1 1
5 C 3 9
6 C 2 8
df = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
Alternative solution:
df = df.groupby(['Name','Diag'])['Time'].mean().unstack()
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
EDIT:
You can also check all duplicates by duplicated:
df = df.loc[df.duplicated(['Name','Diag'], keep=False), ['Name','Diag']]
print (df)
Name Diag
0 A 1
1 A 1
EDIT:
mean of datetimes is not easy - need convert dates to nanoseconds, get mean and last convert to datetimes. Also there is another problem - need replace NaN to some scalar, e.g. 0 what is converted to 0 datetime - 1970-01-01.
df.Date = pd.to_datetime(df.Date)
df['dates_in_ns'] = pd.Series(df.Date.values.astype(np.int64), index=df.index)
df = df.pivot_table(index='Name',
columns='Diag',
values='dates_in_ns',
aggfunc='mean',
fill_value=0)
df = df.apply(pd.to_datetime)
print (df)
Diag 1 2 3
Name
A 2000-07-07 12:00:00 2001-07-23 1970-01-01
B 2016-10-12 00:00:00 1999-09-12 1970-01-01
C 1970-01-01 00:00:00 1998-08-20 2010-01-20

Categories

Resources