Assume we have the following data frame:
# data
t = pd.to_datetime(pd.Series(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-01-01', '2015-02-01']))
g = pd.Series(['A', 'A', 'A', 'A', 'B', 'B'])
v = pd.Series([12.1, 14.2, 15.3, 16.2, 12.2, 13.7])
df = pd.DataFrame({'time': t, 'group': g, 'value': v})
# show data
>>> df
time group value
0 2015-01-01 A 12.1
1 2015-02-01 A 14.2
2 2015-03-01 A 15.3
3 2015-04-01 A 16.2
4 2015-01-01 B 12.2
5 2015-02-01 B 13.7
What I would like to have in the end is the following data frame:
>>> df
time group value
0 2015-01-01 A 12.1
1 2015-02-01 A 14.2
2 2015-03-01 A 15.3
3 2015-04-01 A 16.2
4 2015-01-01 B 12.2
5 2015-02-01 B 13.7
6 2015-03-01 B 13.7
7 2015-04-01 B 13.7
The missing observations in group B should be added and the missing values should default to the last observed value.
How can I achieve this? Thanks in advance!
You can use pivot for reshaping, ffill NaN (fillna with method ffill) and reshape to original by unstack with reset_index:
print (df.pivot(index='time',columns='group',values='value')
.ffill()
.unstack()
.reset_index(name='value'))
group time value
0 A 2015-01-01 12.1
1 A 2015-02-01 14.2
2 A 2015-03-01 15.3
3 A 2015-04-01 16.2
4 B 2015-01-01 12.2
5 B 2015-02-01 13.7
6 B 2015-03-01 13.7
7 B 2015-04-01 13.7
Another solution first find date_range by min and max values of time. Then groupby with resample by D with ffill:
Notice:
I think you forget parameter format='%Y-%d-%m' in to_datetime, if last number is month:
t = pd.to_datetime(pd.Series(['2015-01-01', '2015-02-01', '2015-03-01',
'2015-04-01', '2015-01-01', '2015-02-01']),
format='%Y-%d-%m')
idx = pd.date_range(df.time.min(), df.time.max())
print (idx)
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04'],
dtype='datetime64[ns]', freq='D')
df1 = (df.groupby('group')
.apply(lambda x: x.set_index('time')
.reindex(idx))
.ffill()
.reset_index(level=0, drop=True)
.reset_index()
.rename(columns={'index':'time'}))
print (df1)
time group value
0 2015-01-01 A 12.1
1 2015-01-02 A 14.2
2 2015-01-03 A 15.3
3 2015-01-04 A 16.2
4 2015-01-01 B 12.2
5 2015-01-02 B 13.7
6 2015-01-03 B 13.7
7 2015-01-04 B 13.7
Related
I have the following dataframe:
import pandas as pd
dt = pd.DataFrame({'start_date': ['2019-05-20', '2019-05-21', '2019-05-21'],
'end_date': ['2019-05-23', '2019-05-24', '2019-05-22'],
'reg': ['A', 'B','A'],
'measure': [100, 200,1000]})
I would to create a new column, called 'date', which will have values from start_date until end_date and also have a new column measure_daily which will be the measure spread equally among these dates.
So basically, I would like to expand the dt in terms of rows
So I would like the final df to look like:
dt_f = pd.DataFrame({'date':['2019-05-20','2019-05-21','2019-05-22','2019-05-23','2019-05-21','2019-05-22','2019-05-23','2019-05-24', '2019-05-21','2019-05-22'],
'reg':['A','A','A','A','B','B','B','B','A','A'],
'measure_daily':[25,25,25,25,50,50,50,50,500,500]})
Is there an efficient way to do this in python ?
TL;DR
just give me the solution:
dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)
melt = pd.concat(
[d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()
melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)
Breakdown:
First we melt your start and end date to the same column:
dt = dt.assign(key=dt.index)
melt = dt.melt(id_vars = ['reg', 'measure', 'key'], value_name='date').drop('variable', axis=1)
reg measure key date
0 A 100 0 2019-05-20
1 B 200 1 2019-05-21
2 A 1000 2 2019-05-21
3 A 100 0 2019-05-23
4 B 200 1 2019-05-24
5 A 1000 2 2019-05-22
Then we resample on daily basis while applying groupby to keep the different reg in their own group.
melt = pd.concat(
[d.set_index('date').resample('d').first().ffill() for _, d in melt.groupby(['reg', 'key'], sort=False)]
).reset_index()
date reg measure key
0 2019-05-20 A 100.0 0.0
1 2019-05-21 A 100.0 0.0
2 2019-05-22 A 100.0 0.0
3 2019-05-23 A 100.0 0.0
4 2019-05-21 B 200.0 1.0
5 2019-05-22 B 200.0 1.0
6 2019-05-23 B 200.0 1.0
7 2019-05-24 B 200.0 1.0
8 2019-05-21 A 1000.0 2.0
9 2019-05-22 A 1000.0 2.0
Finally we spread out the measure column over the size of each group with assign:
melt.assign(measure = melt['measure'].div(melt.groupby(['reg', 'key'], sort=False)['reg'].transform('size'))).drop('key', axis=1)
date reg measure
0 2019-05-20 A 25.0
1 2019-05-21 A 25.0
2 2019-05-22 A 25.0
3 2019-05-23 A 25.0
4 2019-05-21 B 50.0
5 2019-05-22 B 50.0
6 2019-05-23 B 50.0
7 2019-05-24 B 50.0
8 2019-05-21 A 500.0
9 2019-05-22 A 500.0
I have a pandas dataframe that looks like this:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
I would like to groupby on KEY and sum on VALUE but only on continuous periods of time. For instance in the above example I would like to get:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-01 5.0
2 B 2017-01-01 2017-02-10 6.0
There are tow groups for A since there is a gap in the time periods.
I would like to avoid for loops since the dataframe has tens of millions of rows.
Create helper Series by compare shifted START column per group and use it for groupby:
s = df.loc[df.groupby('KEY')['START'].shift(-1) == df['END'], 'END']
s = s.combine_first(df['START'])
print (s)
0 2017-01-01
1 2017-01-23
2 2017-01-23
3 2017-02-02
4 2017-02-02
Name: END, dtype: datetime64[ns]
df = df.groupby(['KEY', s], as_index=False).agg({'START':'first','END':'last','VALUE':'sum'})
print (df)
KEY VALUE START END
0 A 2.1 2017-01-01 2017-01-16
1 A 5.0 2017-01-28 2017-03-01
2 B 6.0 2017-01-01 2017-02-10
The answer from jezrael works like a charm if there are only two consecutive rows to aggregate. In the new example, it would not aggregate the last three rows for KEY = A.
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
5 A 2017-03-01 2017-03-23 1.0
The following solution (slight modification of jezrael's solution) enables to aggregate all rows that should be aggregated:
df = df.sort_values(by='START')
idx = df.groupby('KEY')['START'].shift(-1) != df['END']
df['DATE'] = df.loc[idx, 'START']
df['DATE'] = df.groupby('KEY').DATE.fillna(method='backfill')
df = (df.groupby(['KEY', 'DATE'], as_index=False)
.agg({'START': 'first', 'END': 'last', 'VALUE': 'sum'})
.drop(['DATE'], axis=1))
Which gives:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-23 6.0
2 B 2017-01-01 2017-02-10 6.0
Thanks #jezrael for the elegant approach!
I'm trying to reindex a dataframe relative to the second level of an index. I have a dataframe where the first level of the index is user id and the second level is date. For example:
pd.DataFrame({
'id': 3*['A'] + 5*['B'] + 4*['C'],
'date': ['01-01-2010', '02-01-2010', '12-01-2010',
'04-01-2015', '05-01-2015', '03-01-2016', '04-01-2016', '05-01-2016',
'01-01-2015', '02-01-2015', '03-01-2015', '04-01-2015'],
'value': np.random.randint(10,100, 12)})\
.set_index(['id', 'date'])
I want to reindex the dates to fill in the missing dates, but only for the dates between the max and min dates for each "id" group.
For example user "A" should have continuous monthly data from January to December 2010 and user "B" should have continuous dates between April 2015 through May 2016. For simplicity let's assume I want to fill the NaNs with zeros.
Other questions similar to this assume that I want to use the same date_range for all users, which doesn't work in this use case. Any ideas?
I think you need reset_index + groupby + resample + asfreq + fillna:
np.random.seed(123)
df = pd.DataFrame({
'id': 3*['A'] + 5*['B'] + 4*['C'],
'date': ['01-01-2010', '02-01-2010', '12-01-2010',
'04-01-2015', '05-01-2015', '03-01-2016', '04-01-2016', '05-01-2016',
'01-01-2015', '02-01-2015', '03-01-2015', '04-01-2015'],
'value': np.random.randint(10,100, 12)})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id', 'date'])
print (df)
value
id date
A 2010-01-01 76
2010-02-01 27
2010-12-01 93
B 2015-04-01 67
2015-05-01 96
2016-03-01 57
2016-04-01 83
2016-05-01 42
C 2015-01-01 56
2015-02-01 35
2015-03-01 93
2015-04-01 88
df1 = df.reset_index(level='id').groupby('id')['value'].resample('D').asfreq().fillna(0)
print (df1.head(10))
value
id date
A 2010-01-01 76.0
2010-01-02 0.0
2010-01-03 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-09 0.0
2010-01-10 0.0
But if need process only max and min dates first need select data with agg by idxmax
idxmin with loc:
df = df.reset_index()
df1 = df.loc[df.groupby('id')['date'].agg(['idxmin', 'idxmax']).stack()]
print (df1)
id date value
0 A 2010-01-01 76
2 A 2010-12-01 93
3 B 2015-04-01 67
7 B 2016-05-01 42
8 C 2015-01-01 56
11 C 2015-04-01 88
df1 = df1.set_index('date').groupby('id')['value'].resample('MS').asfreq().fillna(0)
print (df1.head(10))
Is that what you want?
In [52]: (df.reset_index().groupby('id')
...: .apply(lambda x: x.set_index('date').resample('D').mean().fillna(0))
...: )
Out[52]:
value
id date
A 2010-01-01 91.0
2010-01-02 0.0
2010-01-03 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-09 0.0
2010-01-10 0.0
... ...
C 2015-03-23 0.0
2015-03-24 0.0
2015-03-25 0.0
2015-03-26 0.0
2015-03-27 0.0
2015-03-28 0.0
2015-03-29 0.0
2015-03-30 0.0
2015-03-31 0.0
2015-04-01 11.0
[823 rows x 1 columns]
PS i have converted date to datetime dtype first...
use groupby and agg to get 'start' and 'end' dates and build set up tuples to reindex with.
m = dict(min='start', max='end')
df = df.reset_index().groupby('id').date.agg(['min', 'max']).rename(columns=m)
idx = [(i, d) for i, row in d2.iterrows() for d in pd.date_range(freq='MS', **row)]
df.reindex(idx, fill_value=0)
value
id date
A 2010-01-01 27
2010-02-01 15
2010-03-01 0
2010-04-01 0
2010-05-01 0
2010-06-01 0
2010-07-01 0
2010-08-01 0
2010-09-01 0
2010-10-01 0
2010-11-01 0
2010-12-01 11
B 2015-04-01 10
2015-05-01 94
2015-06-01 0
2015-07-01 0
2015-08-01 0
2015-09-01 0
2015-10-01 0
2015-11-01 0
2015-12-01 0
2016-01-01 0
2016-02-01 0
2016-03-01 42
2016-04-01 15
2016-05-01 71
C 2015-01-01 17
2015-02-01 51
2015-03-01 99
2015-04-01 58
I have the following pandas dataframe:
import numpy as np
import pandas as pd
dfw = pd.DataFrame({"id": ["A", "B"],
"start_date": pd.to_datetime(["2012-01-01", "2013-02-13"], format="%Y-%m-%d"),
"end_date": pd.to_datetime(["2012-04-17", "2014-11-18"], format="%Y-%m-%d")})
Result:
end_date id start_date
2012-04-17 A 2012-01-01
2014-11-18 B 2013-02-13
I am looking for the most efficient way to transform this dataframe to the following dataframe:
dates = np.empty(0, dtype="datetime64[M]")
dates = np.append(dates, pd.date_range(start="2012-01-01", end="2012-06-01", freq="MS").astype("object"))
dates = np.append(dates, pd.date_range(start="2013-02-01", end="2014-12-01", freq="MS").astype("object"))
dfl = pd.DataFrame({"id": np.repeat(["A", "B"], [6, 23]),
"counter": np.concatenate((np.arange(0, 6, dtype="float"), np.arange(0, 23, dtype="float"))),
"date": pd.to_datetime(dates, format="%Y-%m-%d")})
Result:
counter date id
0.0 2012-01-01 A
1.0 2012-02-01 A
2.0 2012-03-01 A
3.0 2012-04-01 A
4.0 2012-05-01 A
0.0 2013-02-01 B
1.0 2013-03-01 B
2.0 2013-04-01 B
3.0 2013-05-01 B
4.0 2013-06-01 B
5.0 2013-07-01 B
6.0 2013-08-01 B
7.0 2013-09-01 B
8.0 2013-10-01 B
9.0 2013-11-01 B
10.0 2013-12-01 B
11.0 2014-01-01 B
12.0 2014-02-01 B
13.0 2014-03-01 B
14.0 2014-04-01 B
15.0 2014-05-01 B
16.0 2014-06-01 B
17.0 2014-07-01 B
18.0 2014-08-01 B
19.0 2014-09-01 B
20.0 2014-10-01 B
21.0 2014-11-01 B
22.0 2014-12-01 B
A naive solution I came up so far is the following function:
def expand(df):
dates = np.empty(0, dtype="datetime64[ns]")
ids = np.empty(0, dtype="object")
counter = np.empty(0, dtype="float")
for name, group in df.groupby("id"):
start_date = group["start_date"].min()
start_date = pd.to_datetime(np.array(start_date, dtype="datetime64[M]").tolist())
end_date = group["end_date"].min()
end_date = end_date + pd.Timedelta(1, unit="M")
end_date = pd.to_datetime(np.array(end_date, dtype="datetime64[M]").tolist())
tmp = pd.date_range(start=start_date, end=end_date, freq="MS", closed=None).values
dates = np.append(dates, tmp)
ids = np.append(ids, np.repeat(group.id.values[0], len(tmp)))
counter = np.append(counter, np.arange(0, len(tmp)))
dfl = pd.DataFrame({"id": ids, "counter": counter, "date": dates})
return dfl
But it is not very fast:
%timeit expand(dfw)
100 loops, best of 3: 4.84 ms per loop
normally I adivise to avoid itertuples, but in some situations it can be more intuitive. You can get fine-grained control of the endpoints via kwargs to pd.date_range if desired (e.g. to include an endpoint or not)
In [27]: result = pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date)) for r in dfw.itertuples()]).reset_index()
In [28]: result.columns = ['counter', 'date']
In [29]: result
Out[29]:
counter date
0 2012-01-01 A
1 2012-01-02 A
2 2012-01-03 A
3 2012-01-04 A
4 2012-01-05 A
5 2012-01-06 A
.. ... ...
746 2014-11-13 B
747 2014-11-14 B
748 2014-11-15 B
749 2014-11-16 B
750 2014-11-17 B
751 2014-11-18 B
[752 rows x 2 columns]
In [26]: %timeit pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date)) for r in dfw.itertuples()]).reset_index()
100 loops, best of 3: 2.15 ms per loop
Not really sure of the purpose of making this super fast. You would generally do this kind of expansion a single time.
You wanted month starts, so here is that.
In [23]: result = pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date+pd.offsets.MonthBegin(1), freq='MS', closed=None)) for r in dfw.itertuples()]).reset_index()
In [24]: result.columns=['counter', 'date']
In [25]: result
Out[25]:
counter date
0 2012-01-01 A
1 2012-02-01 A
2 2012-03-01 A
3 2012-04-01 A
4 2012-05-01 A
5 2013-03-01 B
6 2013-04-01 B
7 2013-05-01 B
8 2013-06-01 B
9 2013-07-01 B
10 2013-08-01 B
11 2013-09-01 B
12 2013-10-01 B
13 2013-11-01 B
14 2013-12-01 B
15 2014-01-01 B
16 2014-02-01 B
17 2014-03-01 B
18 2014-04-01 B
19 2014-05-01 B
20 2014-06-01 B
21 2014-07-01 B
22 2014-08-01 B
23 2014-09-01 B
24 2014-10-01 B
25 2014-11-01 B
26 2014-12-01 B
You can adjust dates like this
In [17]: pd.Timestamp('2014-01-17')-pd.offsets.MonthBegin(1)
Out[17]: Timestamp('2014-01-01 00:00:00')
In [18]: pd.Timestamp('2014-01-31')-pd.offsets.MonthBegin(1)
Out[18]: Timestamp('2014-01-01 00:00:00')
In [19]: pd.Timestamp('2014-02-01')-pd.offsets.MonthBegin(1)
Out[19]: Timestamp('2014-01-01 00:00:00')
I need to calculate sum of some events between 2015-01-01 and 2015-12-31 made every night between 21:30 and 04:30 next day?
How to made it by using Pandas in a most elegant but possible simple and efficient way?
Example results table should look similar to the following:
count
2015-04-01 38 (events between 2015-03-31 21:30 and 2015-04-01 04:30)
2015-04-02 15 (events between 2015-04-01 21:30 and 2015-04-02 04:30)
2015-04-03 27 (events between 2015-04-02 21:30 and 2015-04-03 04:30)
Thanks for any help and suggestions.
You can use:
df = pd.DataFrame({'a':['2015-04-01 15:00','2015-04-01 23:00','2015-04-01 04:00','2015-04-02 03:00','2015-05-02 16:00','2015-04-03 02:00'],
'b':[2,4,3,1,7,10]})
df['a'] = pd.to_datetime(df.a)
print (df)
a b
0 2015-04-01 15:00:00 2
1 2015-04-01 23:00:00 4
2 2015-04-01 04:00:00 3
3 2015-04-02 03:00:00 1
4 2015-05-02 16:00:00 7
5 2015-04-03 02:00:00 10
Create DatetimeIndex:
start = pd.to_datetime('2015-04-01')
d = pd.date_range(start, periods=3)
print (d)
DatetimeIndex(['2015-04-01', '2015-04-02', '2015-04-03'], dtype='datetime64[ns]', freq='D')
Loop by DatetimeIndex, select all rows by boolean indexing and get len:
for dat in d:
date_sum = len(df.ix[(df.a >= dat.date()+pd.offsets.DateOffset(hours=21, minutes=30)) &
(df.a <= dat.date()+pd.offsets.DateOffset(days=1, hours=4, minutes=30)),'b'])
print (date_sum)
print (dat.date())
2
2015-04-01
1
2015-04-02
0
Create new Series by dict comprehension:
out = { dat.date(): len(df.ix[(df.a >= dat.date() + pd.offsets.DateOffset(hours=21, minutes=30)) & (df.a <= dat.date() + pd.offsets.DateOffset(days=1, hours=4, minutes=30)), 'b']) for dat in d}
s = pd.Series(out)
print (s)
2015-04-01 2
2015-04-02 1
2015-04-03 0
dtype: int64