I have the following dataframe with multi-index:
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
df = df.set_index(["B","A"])
Now I want to create a new columns being the maximum value of the last two values for index A. I tried the following:
df.loc[:,"D"] = df.groupby(level="A").rolling(2).max()
But it only produces N/A for the new column ("D), since the order of the grouped dataframe index is the opposite of the original dataframe.
How can I solve this? I prefer to stay away from stacking/unstacking, swaplevel/sortlevel, join or concat since I have a big dataframe and these operations tend to be quite time consuming.
You need reset_index with drop parameter for remove first level of MultiIndex:
df['D'] = df.groupby(level="A")['C'].rolling(2).max().reset_index(level=0, drop=True)
print (df)
C D
B A
2016-01-01 09:30:00 1 0 NaN
2016-01-01 09:30:01 1 1 1.0
2016-01-01 09:30:02 1 2 2.0
2016-01-01 09:30:03 1 3 3.0
2016-01-01 09:30:04 1 4 4.0
2016-01-01 09:30:05 1 5 5.0
2016-01-01 09:30:06 1 6 6.0
2016-01-01 09:30:07 1 7 7.0
2016-01-01 09:30:08 1 8 8.0
2016-01-01 09:30:09 1 9 9.0
2016-01-01 09:30:10 1 10 10.0
2016-01-01 09:30:11 1 11 11.0
2016-01-01 09:30:12 1 12 12.0
2016-01-01 09:30:13 1 13 13.0
2016-01-01 09:30:14 1 14 14.0
2016-01-01 09:30:15 1 15 15.0
2016-01-01 09:30:16 1 16 16.0
2016-01-01 09:30:17 1 17 17.0
2016-01-01 09:30:18 1 18 18.0
2016-01-01 09:30:19 1 19 19.0
2016-01-01 09:30:00 2 20 NaN
2016-01-01 09:30:01 2 21 21.0
...
...
because:
print (df.groupby(level="A")['C'].rolling(2).max())
A B A
1 2016-01-01 09:30:00 1 NaN
2016-01-01 09:30:01 1 1.0
2016-01-01 09:30:02 1 2.0
2016-01-01 09:30:03 1 3.0
2016-01-01 09:30:04 1 4.0
2016-01-01 09:30:05 1 5.0
2016-01-01 09:30:06 1 6.0
2016-01-01 09:30:07 1 7.0
2016-01-01 09:30:08 1 8.0
2016-01-01 09:30:09 1 9.0
2016-01-01 09:30:10 1 10.0
2016-01-01 09:30:11 1 11.0
2016-01-01 09:30:12 1 12.0
2016-01-01 09:30:13 1 13.0
2016-01-01 09:30:14 1 14.0
2016-01-01 09:30:15 1 15.0
2016-01-01 09:30:16 1 16.0
2016-01-01 09:30:17 1 17.0
2016-01-01 09:30:18 1 18.0
2016-01-01 09:30:19 1 19.0
2 2016-01-01 09:30:00 2 NaN
2016-01-01 09:30:01 2 21.0
...
...
Related
My data set is much larger so I have simplified it.
I want to convert the dataframe into a time-series.
The bit I am stuck on:
I have overlapping date ranges, where I have a smaller date range inside a larger one, as shown by row 0 and row 1, where row 1 and row 2 are inside the date range of row 0.
df:
date1 date2 reduction
0 2016-01-01 - 2016-01-05 7.0
1 2016-01-02 - 2016-01-03 5.0
2 2016-01-03 - 2016-01-04 6.0
3 2016-01-05 - 2016-01-12 10.0
How I want the output to look:
date1 date2 reduction
0 2016-01-01 2016-01-02 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 6.0
3 2016-01-04 2016-01-05 7.0
4 2016-01-05 2016-01-06 10.0
5 2016-01-06 2016-01-07 10.0
6 2016-01-07 2016-01-08 10.0
7 2016-01-08 2016-01-09 10.0
8 2016-01-09 2016-01-10 10.0
9 2016-01-10 2016-01-11 10.0
10 2016-01-11 2016-01-12 10.0
I prepared two consecutive columns of data with minimum and maximum dates and ran updates from the original DF.
import pandas as pd
import numpy as np
import io
data='''
date1 date2 reduction
0 2016-01-01 2016-01-05 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 6.0
3 2016-01-05 2016-01-12 10.0
'''
df = pd.read_csv(io.StringIO(data), sep=' ', index_col=0)
date_1 = pd.date_range(df.date1.min(), df.date2.max())
date_2 = pd.date_range(df.date1.min(), df.date2.max())
df2 = pd.DataFrame({'date1':date_1, 'date2':date_2, 'reduction':[0]*len(date_1)})
df2['date2'] = df2.date2.shift(-1)
df2.dropna(inplace=True)
for i in range(len(df)):
df2['reduction'][(df2.date1 >= df.date1.iloc[i]) & (df2.date2 <= df.date2.iloc[i])] = df.reduction.iloc[i]
df2
date1 date2 reduction
0 2016-01-01 2016-01-02 7
1 2016-01-02 2016-01-03 5
2 2016-01-03 2016-01-04 6
3 2016-01-04 2016-01-05 7
4 2016-01-05 2016-01-06 10
5 2016-01-06 2016-01-07 10
6 2016-01-07 2016-01-08 10
7 2016-01-08 2016-01-09 10
8 2016-01-09 2016-01-10 10
9 2016-01-10 2016-01-11 10
10 2016-01-11 2016-01-12 10
My data set is much larger so I have simplified it.
I want to convert the dataframe into a time-series.
The bit I am stuck on:
I have overlapping date ranges, where I have a smaller date range inside a larger one, as shown by row 0 and row 1, where row 1 and row 2 are inside the date range of row 0.
df:
date1 date2 reduction
0 2016-01-01 - 2016-01-05 7.0
1 2016-01-02 - 2016-01-03 5.0
2 2016-01-03 - 2016-01-04 6.0
3 2016-01-05 - 2016-01-12 10.0
How I want the output to look:
date1 date2 reduction
0 2016-01-01 2016-01-02 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 6.0
3 2016-01-04 2016-01-05 7.0
4 2016-01-05 2016-01-06 10.0
5 2016-01-06 2016-01-07 10.0
6 2016-01-07 2016-01-08 10.0
7 2016-01-08 2016-01-09 10.0
8 2016-01-09 2016-01-10 10.0
9 2016-01-10 2016-01-11 10.0
10 2016-01-11 2016-01-12 10.0
I think this does what you want...
import pandas as pd
import datetime
first={'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,6),datetime.date(2016,1,7),
datetime.date(2016,1,8),datetime.date(2016,1,9),datetime.date(2016,1,10),datetime.date(2016,1,11)],
'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,7),datetime.date(2016,1,8),
datetime.date(2016,1,9),datetime.date(2016,1,10),datetime.date(2016,1,11),datetime.date(2016,1,12)],
'reduction':[7,5,3,2,9,3,8,3]}
df=pd.DataFrame.from_dict(first)
blank = pd.DataFrame(index=pd.date_range(df["date1"].min(), df["date2"].max()))
blank["r1"] = blank.join(df[["date1", "reduction"]].set_index("date1"), how="left")["reduction"]
blank["r2"] = blank.join(df[["date2", "reduction"]].set_index("date2"), how="left")["reduction"]
blank["r2"] = blank["r2"].shift(-1)
tmp = blank[pd.notnull(blank).any(axis=1)][pd.isnull(blank).any(axis=1)].reset_index().melt(id_vars=["index"])
tmp = tmp.sort_values(by="index").bfill()
blank1 = pd.DataFrame(index=pd.date_range(tmp["index"].min(), tmp["index"].max()))
tmp = blank1.join(tmp.set_index("index"), how="left").bfill().reset_index().groupby("index")["value"].first()
blank["r1"] = blank["r1"].combine_first(blank.join(tmp, how="left")["value"])
final = pd.DataFrame(data={"date1": blank.iloc[:-1, :].index, "date2": blank.iloc[1:, :].index, "reduction":blank["r1"].iloc[:-1].fillna(5).values})
Output:
date1 date2 reduction
0 2016-01-01 2016-01-02 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 7.0
3 2016-01-04 2016-01-05 7.0
4 2016-01-05 2016-01-06 5.0
5 2016-01-06 2016-01-07 3.0
6 2016-01-07 2016-01-08 2.0
7 2016-01-08 2016-01-09 9.0
8 2016-01-09 2016-01-10 3.0
9 2016-01-10 2016-01-11 8.0
10 2016-01-11 2016-01-12 3.0
Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
By using resample with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
use asfreq & groupby
first convert dt to datetime & get rid of duplicates
then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts
finally reset_index on amount column as there's a duplicate sub_id column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df.
Create a base_df with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0
I have two dataframes with different number of rows, one has daily values and the second one has hourly values. I wanted to compare them and if the dates match then I would like to add daily values infront of hourly values of the same day. The dataframes are;
import pandas as pd
df1 = pd.read_csv('C:\Users\ABC.csv')
df2 = pd.read_csv('C:\Users\DEF.csv')
df1 = pd.to_datetime(df1['Datetime'])
df2 = pd.to_datetime(df2['Datetime'])
df1.head()
Out [3]
Datetime Value
0 2016-02-02 21:00:00 0.6
1 2016-02-02 22:00:00 0.4
2 2016-02-02 23:00:00 0.4
3 2016-03-02 00:00:00 0.3
4 2016-03-02 01:00:00 0.2
df2.head()
Out [4] Datetime No of people
0 2016-02-02 56
1 2016-03-02 60
2 2016-04-02 91
3 2016-05-02 87
4 2016-06-02 90
What I would like to have is something like this;
Datetime Value No of People
0 2016-02-02 21:00:00 0.6 56
1 2016-02-02 22:00:00 0.4 56
2 2016-02-02 23:00:00 0.4 56
3 2016-03-02 00:00:00 0.3 60
4 2016-03-02 01:00:00 0.2 60
Any idea, how to do this in Python using Pandas? please note, there may be some dates missing.
you can set index to df1.Datetime.dt.date for the df1 DF and then you can join it with df2:
In [46]: df1.set_index(df1.Datetime.dt.date).join(df2.set_index('Datetime')).reset_index(drop=True)
Out[46]:
Datetime Value No_of_people
0 2016-02-02 21:00:00 0.6 56
1 2016-02-02 22:00:00 0.4 56
2 2016-02-02 23:00:00 0.4 56
3 2016-03-02 00:00:00 0.3 60
4 2016-03-02 01:00:00 0.2 60
optionally you may want to use how='left' parameter when calling the join() function
You can just use pd.concat and fillna(method='ffill') because the date values match the first value on any day:
df1 = pd.DataFrame(data={'day': np.random.randint(low=50, high=100, size=10), 'date':pd.date_range(date(2016,1,1), freq='D', periods=10)})
date day
0 2016-01-01 55
1 2016-01-02 51
2 2016-01-03 92
3 2016-01-04 78
4 2016-01-05 72
df2 = pd.DataFrame(data={'hour': np.random.randint(low=1, high=10, size=100), 'datetime': pd.date_range(date(2016,1,1), freq='H', periods=100)})
datetime hour
0 2016-01-01 00:00:00 5
1 2016-01-01 01:00:00 1
2 2016-01-01 02:00:00 4
3 2016-01-01 03:00:00 5
4 2016-01-01 04:00:00 2
like so:
pd.concat([df2.set_index('datetime'), df1.set_index('date')], axis=1).fillna(method='ffill')
to get:
hour day
2016-01-01 00:00:00 5.0 55.0
2016-01-01 01:00:00 1.0 55.0
2016-01-01 02:00:00 4.0 55.0
2016-01-01 03:00:00 5.0 55.0
2016-01-01 04:00:00 2.0 55.0
2016-01-01 05:00:00 3.0 55.0
2016-01-01 06:00:00 5.0 55.0
2016-01-01 07:00:00 6.0 55.0
2016-01-01 08:00:00 6.0 55.0
2016-01-01 09:00:00 8.0 55.0
2016-01-01 10:00:00 3.0 55.0
2016-01-01 11:00:00 5.0 55.0
2016-01-01 12:00:00 7.0 55.0
2016-01-01 13:00:00 7.0 55.0
2016-01-01 14:00:00 4.0 55.0
2016-01-01 15:00:00 5.0 55.0
2016-01-01 16:00:00 7.0 55.0
2016-01-01 17:00:00 4.0 55.0
2016-01-01 18:00:00 6.0 55.0
2016-01-01 19:00:00 1.0 55.0
2016-01-01 20:00:00 8.0 55.0
2016-01-01 21:00:00 8.0 55.0
2016-01-01 22:00:00 2.0 55.0
2016-01-01 23:00:00 3.0 55.0
2016-01-02 00:00:00 7.0 51.0
2016-01-02 01:00:00 6.0 51.0
2016-01-02 02:00:00 8.0 51.0
2016-01-02 03:00:00 6.0 51.0
2016-01-02 04:00:00 1.0 51.0
2016-01-02 05:00:00 5.0 51.0
... ... ...
2016-01-04 03:00:00 6.0 78.0
2016-01-04 04:00:00 9.0 78.0
2016-01-04 05:00:00 1.0 78.0
2016-01-04 06:00:00 6.0 78.0
2016-01-04 07:00:00 3.0 78.0
2016-01-04 08:00:00 9.0 78.0
2016-01-04 09:00:00 5.0 78.0
2016-01-04 10:00:00 3.0 78.0
2016-01-04 11:00:00 6.0 78.0
2016-01-04 12:00:00 4.0 78.0
2016-01-04 13:00:00 2.0 78.0
2016-01-04 14:00:00 4.0 78.0
2016-01-04 15:00:00 3.0 78.0
2016-01-04 16:00:00 4.0 78.0
2016-01-04 17:00:00 9.0 78.0
2016-01-04 18:00:00 8.0 78.0
2016-01-04 19:00:00 4.0 78.0
2016-01-04 20:00:00 7.0 78.0
2016-01-04 21:00:00 1.0 78.0
2016-01-04 22:00:00 6.0 78.0
2016-01-04 23:00:00 1.0 78.0
2016-01-05 00:00:00 5.0 72.0
2016-01-05 01:00:00 8.0 72.0
2016-01-05 02:00:00 6.0 72.0
2016-01-05 03:00:00 3.0 72.0
2016-01-06 00:00:00 3.0 87.0
2016-01-07 00:00:00 3.0 50.0
2016-01-08 00:00:00 3.0 65.0
2016-01-09 00:00:00 3.0 81.0
2016-01-10 00:00:00 3.0 65.0
My raw data looks like the following:
start_date end_date value
0 2016-01-01 2016-01-03 2
1 2016-01-05 2016-01-08 4
The interpretation is that the data takes a value of 2 between 1/1/2016 and 1/3/2016, and it takes a value of 4 between 1/5/2016 and 1/8/2016. I want to transform the raw data to a daily time series like the following:
2016-01-01 2
2016-01-02 2
2016-01-03 2
2016-01-04 0
2016-01-05 4
2016-01-06 4
2016-01-07 4
2016-01-08 4
Note that if a date in the time series doesn't appear between the start_date and end_date in any row of the raw data, it gets a value of 0 in the time series.
I can create the time series by looping through the raw data, but that's slow. Is there a faster way to do it?
You may try this:
In [120]: df
Out[120]:
start_date end_date value
0 2016-01-01 2016-01-03 2
1 2016-01-05 2016-01-08 4
In [121]: new = pd.DataFrame({'dt': pd.date_range(df.start_date.min(), df.end_date.max())})
In [122]: new
Out[122]:
dt
0 2016-01-01
1 2016-01-02
2 2016-01-03
3 2016-01-04
4 2016-01-05
5 2016-01-06
6 2016-01-07
7 2016-01-08
In [123]: new = new.merge(df, how='left', left_on='dt', right_on='start_date').fillna(method='pad')
In [124]: new
Out[124]:
dt start_date end_date value
0 2016-01-01 2016-01-01 2016-01-03 2.0
1 2016-01-02 2016-01-01 2016-01-03 2.0
2 2016-01-03 2016-01-01 2016-01-03 2.0
3 2016-01-04 2016-01-01 2016-01-03 2.0
4 2016-01-05 2016-01-05 2016-01-08 4.0
5 2016-01-06 2016-01-05 2016-01-08 4.0
6 2016-01-07 2016-01-05 2016-01-08 4.0
7 2016-01-08 2016-01-05 2016-01-08 4.0
In [125]: new.ix[(new.dt < new.start_date) | (new.dt > new.end_date), 'value'] = 0
In [126]: new[['dt', 'value']]
Out[126]:
dt value
0 2016-01-01 2.0
1 2016-01-02 2.0
2 2016-01-03 2.0
3 2016-01-04 0.0
4 2016-01-05 4.0
5 2016-01-06 4.0
6 2016-01-07 4.0
7 2016-01-08 4.0