pandas RollingGroupBy agg 'size' of rolling group (not 'count') - python

It is possible to perform a
df.groupby.rolling.agg({'any_df_col': 'count'})
But how about a size agg?
'count' will produce a series with the 'running count' of rows that match the groupby condition (1, 1, 1, 2, 3...), but I would like to know, for all of those rows, the total number of rows that match the groupby (so, 1, 1, 3, 3, 3) in that case.
Usually in pandas I think this is achieved by using size instead of count.
This code may illustrate.
import datetime as dt
import pandas as pd
df = pd.DataFrame({'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4),
],
'value': [1, 2, 1, 10, 10, 10, 10],
'type': [0, 0, 0, 0, 0, 0, 0]
})
df = df.set_index(pd.DatetimeIndex(df['time_ref']), drop=True)
by = ['value']
window = '1H'
gb_rolling = df.groupby(by=by).rolling(window=window)
agg_d = {'type': 'count'}
test = gb_rolling.agg(agg_d)
print (test)
# this works
type
value time_ref
1 2023-01-01 00:30:00 1.0
2023-01-01 01:00:00 2.0
2 2023-01-01 00:30:00 1.0
10 2023-01-01 02:00:00 1.0
2023-01-01 02:15:00 2.0
2023-01-01 02:16:00 3.0
2023-01-01 04:00:00 1.0
# but this doesn't
agg_d = {'type': 'size'}
test = gb_rolling.agg(agg_d)
# AttributeError: 'size' is not a valid function for 'RollingGroupby' object
my desired output is to get the SIZE of the group ... this:
type
value time_ref
1 2023-01-01 00:30:00 2
2023-01-01 01:00:00 2
2 2023-01-01 00:30:00 1
10 2023-01-01 02:00:00 3
2023-01-01 02:15:00 3
2023-01-01 02:16:00 3
2023-01-01 04:00:00 1
I cannot think of a way to do what I need without using the rolling functionality, because the relevant windows of my data are not deteremined by calendar time but by the time of the events themselves... if that assumption is wrong, and I can do that and get a 'size' without using rolling, that is OK, but as far as I know I have to use rolling since the time_ref of the event is the important thing for grouping with subsequent rows, not pure calendar time.
Thanks.

I'm not completely following your question. It seems like you want the type column to be the number of rows of a given value for each 1-hour increment... But if that's the case your desired output is incorrect, and should be:
value time_ref type
1 2023-01-01 00:30:00 1 # <- not 2 here (1 in 0-hr, 1 in 1-hr window)
2023-01-01 01:00:00 1 # <- same here
2 2023-01-01 00:30:00 1 # rest is ok....
...
If that's correct, then, starting with:
df = pd.DataFrame({
'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4)],
'value': [1, 2, 1, 10, 10, 10, 10]})
...just add an hour column:
df['hour'] = df.time_ref.dt.hour
and aggregate on that and value:
tmp = (
df.groupby(['value', 'hour'])
.agg('count')
.reset_index()
.rename(columns={'time_ref': 'type'}))
which gives you:
value hour type
0 1 0 1
1 1 1 1
2 2 0 1
3 10 2 3
4 10 4 1
...which you can join back onto your original df:
res = df.merge(tmp, how='left', on=['value', 'hour'])
time_ref value hour type
0 2023-01-01 00:30:00 1 0 1
1 2023-01-01 00:30:00 2 0 1
2 2023-01-01 01:00:00 1 1 1
3 2023-01-01 02:00:00 10 2 3
4 2023-01-01 02:15:00 10 2 3
5 2023-01-01 02:16:00 10 2 3
6 2023-01-01 04:00:00 10 4 1
If that's not what you're looking for, you may clarify your question.

Ah.. thanks for clarifying. I understand the problem now.
I played around with rolling, but couldn't find a way to get it to work either... but here is an alternate method:
df = pd.DataFrame({
'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4)],
'value': [1, 2, 1, 10, 10, 10, 10]})
df.index = df.time_ref
value_start = df.groupby('value').agg(min)
df['hrs_since_group_start'] = df.apply(
lambda row: row.time_ref - value_start.loc[row.value, 'time_ref'],
axis=1
).view(int) / 1_000_000_000 / 60 // 60
(.view(int) changes the timedelta to nanoseconds. so the / 1_000_000_000 / 60 changes it to minutes since the first group, and // 60 changes it to number of whole hours since first group.)
group_hourly_counts = (
df.groupby(['value', 'hrs_since_group_start'])
.agg('count')
.reset_index()
.rename(columns={'time_ref': 'type'}))
res = (
df.merge(
group_hourly_counts,
how='left',
on=['value', 'hrs_since_group_start'])
.drop(columns='hrs_since_group_start'))
res:
time_ref value type
0 2023-01-01 00:30:00 1 2
1 2023-01-01 00:30:00 2 1
2 2023-01-01 01:00:00 1 2
3 2023-01-01 02:00:00 10 3
4 2023-01-01 02:15:00 10 3
5 2023-01-01 02:16:00 10 3
6 2023-01-01 04:00:00 10 1
...somebody more familiar with the rolling functionality can probably find you a simpler solution though :)

If .rolling in combination with count doesn't work then I don't think this is really a "rolling-problem". You could try the following (I think it's similar to Damians 2. answer):
df = df.assign(
hours=df["time_ref"].sub(df.groupby("value")["time_ref"].transform("first"))
.dt.seconds.floordiv(3_600),
type=lambda df: df.groupby(["value", "hours"]).transform("size")
).drop(columns="hours").set_index(["value", "time_ref"]).sort_index()
Result for the sample:
type
value time_ref
1 2023-01-01 00:30:00 2
2023-01-01 01:00:00 2
2 2023-01-01 00:30:00 1
10 2023-01-01 02:00:00 3
2023-01-01 02:15:00 3
2023-01-01 02:16:00 3
2023-01-01 04:00:00 1

Related

Convert 2D dataframe to 3D numpy array based on unique ID

I have a dataframe in this format:
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
... though my dataframe is much larger, with more than 500 hundred IDs.
I want to convert this 2D - dataframe into a 3D array in this format (num_time_samples, value, ID). Essentially I would like to have one 2D-array for every unique ID.
I plan on using the value column to build lag based feature vectors, but I'm stuck on how to convert the dataframe. I've searched and tried df.value, reshaping, etc and nothing has worked.
Say you have
df = pd.DataFrame(
{
'time column': [
'00:00:00', '00:15:00', '00:00:00', '00:15:00',
],
'ID column': [
1, 1, 2, 2,
],
'Value': [
10, 0, 6, 2,
],
}
)
where df actually is a subset of your dataframe keeping eveything data-type-naive.
I want to convert this 2D - dataframe into a 3D array in this format (num_time_samples, value, ID).
Why not do
a = (
df
.set_index(['time column', 'ID column'])
.unstack(level=-1) # which leaves 'time column' as first dimension index
.to_numpy()
.reshape(
(
df['ID column'].unique().size,
df['time column'].unique().size,
1,
)
)
)
a looks like
>>> a
array([[[10],
[ 6]],
[[ 0],
[ 2]]], dtype=int64)
>>> a.shape
(2, 2, 1)
>>> a.ndim
3
a is structured as time column × ID column × Value (and indexable accordingly). E.g. let's get individuals' 00:15:00-data
>>> a[1] # <=> a[1, ...] <=> a[1, :, :]
array([[0],
[2]], dtype=int64)
Let's get the first and second individual's time series, respectively,
>>> a[:, 0] # <=> a[:, 0, :] <=> a[..., 0, :]
array([[10],
[ 0]], dtype=int64)
>>> a[:, 1]
array([[6],
[2]], dtype=int64)

Is there a way to know how many items are sold until the end of the day after being considered a hit?

Let's imagine that we have this dataset.
import pandas as pd
import numpy as np
# create list
data = [['10/1/2019 08:12:09', np.nan, 0, 54], ['10/1/2019 09:12:09', '10/1/2019 08:52:09', 1, 54], ['10/1/2019 10:30:19','10/1/2019 10:10:09', 1, 3],
['10/1/2019 13:07:19', '10/1/2019 12:52:09', 1, 12], ['10/1/2019 13:25:09', np.nan, 0, 3],
['10/1/2019 17:52:09', np.nan, 0, 54], ['10/1/2019 18:21:09', np.nan, 0, 12],
['10/2/2019 10:52:09', np.nan, 0, 54], ['10/2/2019 12:59:19','10/2/2019 12:57:09', 1, 12],
['10/2/2019 13:52:19', '10/2/2019 13:39:09', 1, 54], ['10/2/2019 19:52:09', np.nan, 0, 12],
['10/2/2019 20:52:09', np.nan, 0, 54], ['10/2/2019 20:57:09', np.nan, 0, 12]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['first_timestamp', 'second_timestamp', 'hit', 'item'])
# print the dataframe
df
first_timestamp second_timestamp hit item
0 10/1/2019 08:12:09 NaN 0 54
1 10/1/2019 09:12:09 10/1/2019 08:52:09 1 54
2 10/1/2019 10:30:19 10/1/2019 10:10:09 1 3
3 10/1/2019 13:07:19 10/1/2019 12:52:09 1 12
4 10/1/2019 13:25:09 NaN 0 3
5 10/1/2019 17:52:09 NaN 0 54
6 10/1/2019 18:21:09 NaN 0 12
7 10/2/2019 10:52:09 NaN 0 54
8 10/2/2019 12:59:19 10/2/2019 12:57:09 1 12
9 10/2/2019 13:52:19 10/2/2019 13:39:09 1 54
10 10/2/2019 19:52:09 NaN 0 12
11 10/2/2019 20:52:09 NaN 0 54
12 10/2/2019 20:57:09 NaN 0 12
When we don't have missing values in both timestamps' columns, it means that the hit column has a value of 0. When both timestamps' columns have a value the hit column has a value of 1. My goal is to know how many items of the ones that I have(3, 12, and 54) were sold until the end of the respective day after it happened (only after, not before) a hit equals to 1.
day item items_sold
3 1
10/1/2019 12 1
54 1
3 0
10/2/2019 12 2
54 1
IIUC, for your data, we can do this:
# it's good practice to have time as datetime type
# skip if already is
df.first_timestamp = pd.to_datetime(df.first_timestamp)
df.second_timestamp = pd.to_datetime(df.second_timestamp)
# dates
df['date'] = df.first_timestamp.dt.normalize()
# s counts the number of hit so far
(df.assign(s=df.groupby(['date', 'item'])['hit'].cumsum())
.query('hit != 1 & s>0') # after the first hit
.groupby('date') # groupby date
['item'].value_counts() # counts
)
Output:
first_timestamp item
2019-10-01 3 1
12 1
54 1
2019-10-02 12 2
54 1
Name: item, dtype: int64

Pandas groupby with identification of an element with max value in another column

I have a dataframe with sales results of items with different pricing rules:
import pandas as pd
from datetime import timedelta
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
# Create datetimes and data
df_1['item'] = [1, 1, 2, 2, 2]
df_1['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_1['price_rule'] = ['a', 'b', 'a', 'b', 'b']
df_1['sales']= [2, 4, 1, 5, 7]
df_1['clicks']= [7, 8, 9, 10, 11]
df_2['item'] = [1, 1, 2, 2, 2]
df_2['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_2['price_rule'] = ['b', 'b', 'a', 'a', 'a']
df_2['sales']= [2, 3, 4, 5, 6]
df_2['clicks']= [7, 8, 9, 10, 11]
df_3['item'] = [1, 1, 2, 2, 2]
df_3['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_3['price_rule'] = ['b', 'a', 'b', 'a', 'b']
df_3['sales']= [6, 5, 4, 5, 6]
df_3['clicks']= [7, 8, 9, 10, 11]
df = pd.concat([df_1, df_2, df_3])
df = df.sort_values(['item', 'date'])
df.reset_index(drop=True)
df
It results with:
item date price_rule sales clicks
0 1 2018-01-01 a 2 7
0 1 2018-01-01 b 2 7
0 1 2018-01-01 b 6 7
1 1 2018-01-02 b 4 8
1 1 2018-01-02 b 3 8
1 1 2018-01-02 a 5 8
2 2 2018-01-03 a 1 9
2 2 2018-01-03 a 4 9
2 2 2018-01-03 b 4 9
3 2 2018-01-04 b 5 10
3 2 2018-01-04 a 5 10
3 2 2018-01-04 a 5 10
4 2 2018-01-05 b 7 11
4 2 2018-01-05 a 6 11
4 2 2018-01-05 b 6 11
My goal is to:
1. group all items by day (to get a single row for each item and given day)
2. aggregate 'clicks' with "sum"
3. generate a "winning_pricing_rule" columns as following:
- for a given item and given date, take a pricing rule with the highest 'sales' value
- in case of 'draw' (see eg: item 2 on 2018-01-03 in a sample above): choose just one of them (that's rare in my dataset, so it can be random...)
I imagine the result to look like this:
item date winning_price_rule clicks
0 1 2018-01-01 b 21
1 1 2018-01-02 a 24
2 2 2018-01-03 b 27 <<remark: could also be a (due to draw)
3 2 2018-01-04 a 30 <<remark: could also be b (due to draw)
4 2 2018-01-05 b 33
I tried:
a.groupby(['item', 'date'], as_index = False).agg({'sales':'sum','revenue':'max'})
but failed to identify a winning pricing rule.
Any ideas? Many Thanks for help :)
Andy
First convert column price_rule to index by DataFrame.set_index, so for winning_price_rule is possible use DataFrameGroupBy.idxmax - get index value by maximum sales in GroupBy.agg, because also is necessary aggregate sum:
df1 = (df.set_index('price_rule')
.groupby(['item', 'date'])
.agg({'sales':'idxmax', 'clicks':'sum'})
.reset_index())
For pandas 0.25.+ is possible use:
df1 = (df.set_index('price_rule')
.groupby(['item', 'date'])
.agg(winning_pricing_rule=pd.NamedAgg(column='sales', aggfunc='idxmax'),clicks=pd.NamedAgg(column='clicks', aggfunc="sum'))
.reset_index())

Panda .loc or .iloc to select the columns from a dataset

I have been trying to select a particular set of columns from a dataset for all the rows. I tried something like below.
train_features = train_df.loc[,[0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]]
I want to mention that all rows are inclusive but only need the numbered columns.
Is there any better way to approach this.
sample data:
age job marital education default housing loan equities contact duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
56 housemaid married basic.4y 1 1 1 1 0 261 1 999 0 2 1.1 93.994 -36.4 3.299552287 5191 1
37 services married high.school 1 0 1 1 0 226 1 999 0 2 1.1 93.994 -36.4 0.743751247 5191 1
56 services married high.school 1 1 0 1 0 307 1 999 0 2 1.1 93.994 -36.4 1.28265179 5191 1
I'm trying to neglect job, marital, education and y column in my dataset. y column is the target variable.
If need select by positions use iloc:
train_features = train_df.iloc[:, [0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]]
print (train_features)
age default housing loan equities contact duration campaign pdays \
0 56 1 1 1 1 0 261 1 999
1 37 1 0 1 1 0 226 1 999
2 56 1 1 0 1 0 307 1 999
previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m \
0 0 2 1.1 93.994 -36.4 3.299552
1 0 2 1.1 93.994 -36.4 0.743751
2 0 2 1.1 93.994 -36.4 1.282652
nr.employed
0 5191
1 5191
2 5191
Another solution is drop unnecessary columns:
cols= ['job','marital','education','y']
train_features = train_df.drop(cols, axis=1)
print (train_features)
age default housing loan equities contact duration campaign pdays \
0 56 1 1 1 1 0 261 1 999
1 37 1 0 1 1 0 226 1 999
2 56 1 1 0 1 0 307 1 999
previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m \
0 0 2 1.1 93.994 -36.4 3.299552
1 0 2 1.1 93.994 -36.4 0.743751
2 0 2 1.1 93.994 -36.4 1.282652
nr.employed
0 5191
1 5191
2 5191
You can access the column values via the the underlying numpy array
Consider the dataframe df
df = pd.DataFrame(np.random.randint(10, size=(5, 20)))
df
You can slice the underlying array
slc = [0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
df.values[:, slc]
array([[1, 3, 9, 8, 3, 2, 1, 6, 6, 0, 3, 9, 8, 5, 9, 9],
[8, 0, 2, 3, 7, 8, 9, 2, 7, 2, 1, 3, 2, 5, 4, 9],
[1, 1, 9, 3, 5, 8, 8, 8, 8, 4, 8, 0, 5, 4, 9, 0],
[6, 3, 1, 8, 0, 3, 7, 9, 9, 0, 9, 7, 6, 1, 4, 8],
[3, 2, 3, 3, 9, 8, 3, 8, 3, 4, 1, 6, 4, 1, 6, 4]])
Or you can reconstruct a new dataframe from this slice
pd.DataFrame(df.values[:, slc], df.index, df.columns[slc])
This is not as clean and intuitive as
df.iloc[:, slc]
You could also use slc to slice the df.columns object and pass that to df.loc
df.loc[:, df.columns[slc]]

Insert 0-values for missing dates within MultiIndex

Let's assume I have a MultiIndex which consists of the date and some categories (one for simplicity in the example below) and for each category I have a time series with values of some process.
I only have a value when there was an observation and I now want to add a "0" whenever there was no observation on that date.
I found a way which seems very inefficient (stacking and unstacking which will create many many columns in case of millions of categories).
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(datetime.date(2013, 2, 10), 1, 4),
(datetime.date(2013, 2, 10), 2, 7),
(datetime.date(2013, 2, 11), 2, 7),
(datetime.date(2013, 2, 13), 1, 2),
(datetime.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
print df
print df.unstack().reindex(all_dates).fillna(0).stack()
# insert 0 values for missing dates
print all_dates
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
value
category
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
[datetime.date(2013, 2, 13), datetime.date(2013, 2, 12),
datetime.date(2013, 2, 11), datetime.date(2013, 2, 10)]
Does anybody know a smarter way to achieve the same?
EDIT: I found another possibility to achieve the same:
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)]
df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5),
(datetime.date(2013, 2, 10), 2,1, 7),
(datetime.date(2013, 2, 10), 2,2, 7),
(datetime.date(2013, 2, 11), 2,3, 7),
(datetime.date(2013, 2, 13), 1,4, 2),
(datetime.date(2013, 2, 13), 2,4, 3)],
columns = ['date', 'category', 'cat2', 'value'])
date_col = 'date'
other_index = ['category', 'cat2']
index = [date_col] + other_index
df.set_index(index, inplace=True)
grouped = df.groupby(level=other_index)
df_list = []
for i, group in grouped:
df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0))
print pd.concat(df_list).set_index(other_index, append=True)
value
category cat2
2013-02-13 1 4 2
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 1 4 5
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 1 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 2 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 2 3 7
2013-02-10 0 0 0
2013-02-13 2 4 3
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 0 0 0
You can make a new multi index based on the Cartesian product of the index levels you want. Then, re-index your data frame using the new index.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
new_df = df.reindex(new_index)
# Optional: convert missing values to zero, and convert the data back
# to integers. See explanation below.
new_df = new_df.fillna(0).astype(int)
That's it! The new data frame has all the possible index values. The existing data is indexed correctly.
Read on for a more detailed explanation.
Explanation
Set up sample data
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(dt.date(2013, 2, 10), 1, 4),
(dt.date(2013, 2, 10), 2, 7),
(dt.date(2013, 2, 11), 2, 7),
(dt.date(2013, 2, 13), 1, 2),
(dt.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
Here's what the sample data looks like
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
Make new index
Using from_product we can make a new multi index. This new index is the Cartesian product of all the values you pass to the function.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
Reindex
Use the new index to reindex the existing data frame.
All the possible combinations are now present. The missing values are null (NaN).
new_df = df.reindex(new_index)
Now, the expanded, re-indexed data frame looks like this:
value
2013-02-13 1 2.0
2 3.0
2013-02-12 1 NaN
2 NaN
2013-02-11 1 NaN
2 7.0
2013-02-10 1 4.0
2 7.0
Nulls in integer column
You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.
new_df = new_df.fillna(0).astype(int)
Result
value
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
Checkout this answer: How to fill the missing record of Pandas dataframe in pythonic way?
You can do something like:
import datetime
import pandas as pd
#make an empty dataframe with the index you want
def get_datetime(x):
return datetime.date(2013, 2, 13)- datetime.timedelta(days=x)
all_dates = [ get_datetime(x) for x in range(4)]
categories = [1,2,3,4]
index = [ [date, cat] for cat in categories for date in all_dates ]
#this df will be just an index
df = pd.DataFrame(index)
df =print df.set_index([0,1])
df.columns = ['date', 'category']
df = df.set_index(['date', 'category'])
#now if your original df is called df_original you can reindex against the other values
df_orig = df_orig.reindex_axis(df.index)
#and to add zeros
df_orig.fillna(0)

Categories

Resources