My data set is much larger so I have simplified it.
I want to convert the dataframe into a time-series.
The bit I am stuck on:
I have overlapping date ranges, where I have a smaller date range inside a larger one, as shown by row 0 and row 1, where row 1 and row 2 are inside the date range of row 0.
df:
date1 date2 reduction
0 2016-01-01 - 2016-01-05 7.0
1 2016-01-02 - 2016-01-03 5.0
2 2016-01-03 - 2016-01-04 6.0
3 2016-01-05 - 2016-01-12 10.0
How I want the output to look:
date1 date2 reduction
0 2016-01-01 2016-01-02 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 6.0
3 2016-01-04 2016-01-05 7.0
4 2016-01-05 2016-01-06 10.0
5 2016-01-06 2016-01-07 10.0
6 2016-01-07 2016-01-08 10.0
7 2016-01-08 2016-01-09 10.0
8 2016-01-09 2016-01-10 10.0
9 2016-01-10 2016-01-11 10.0
10 2016-01-11 2016-01-12 10.0
I prepared two consecutive columns of data with minimum and maximum dates and ran updates from the original DF.
import pandas as pd
import numpy as np
import io
data='''
date1 date2 reduction
0 2016-01-01 2016-01-05 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 6.0
3 2016-01-05 2016-01-12 10.0
'''
df = pd.read_csv(io.StringIO(data), sep=' ', index_col=0)
date_1 = pd.date_range(df.date1.min(), df.date2.max())
date_2 = pd.date_range(df.date1.min(), df.date2.max())
df2 = pd.DataFrame({'date1':date_1, 'date2':date_2, 'reduction':[0]*len(date_1)})
df2['date2'] = df2.date2.shift(-1)
df2.dropna(inplace=True)
for i in range(len(df)):
df2['reduction'][(df2.date1 >= df.date1.iloc[i]) & (df2.date2 <= df.date2.iloc[i])] = df.reduction.iloc[i]
df2
date1 date2 reduction
0 2016-01-01 2016-01-02 7
1 2016-01-02 2016-01-03 5
2 2016-01-03 2016-01-04 6
3 2016-01-04 2016-01-05 7
4 2016-01-05 2016-01-06 10
5 2016-01-06 2016-01-07 10
6 2016-01-07 2016-01-08 10
7 2016-01-08 2016-01-09 10
8 2016-01-09 2016-01-10 10
9 2016-01-10 2016-01-11 10
10 2016-01-11 2016-01-12 10
Related
My data set is much larger so I have simplified it.
I want to convert the dataframe into a time-series.
The bit I am stuck on:
I have overlapping date ranges, where I have a smaller date range inside a larger one, as shown by row 0 and row 1, where row 1 and row 2 are inside the date range of row 0.
df:
date1 date2 reduction
0 2016-01-01 - 2016-01-05 7.0
1 2016-01-02 - 2016-01-03 5.0
2 2016-01-03 - 2016-01-04 6.0
3 2016-01-05 - 2016-01-12 10.0
How I want the output to look:
date1 date2 reduction
0 2016-01-01 2016-01-02 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 6.0
3 2016-01-04 2016-01-05 7.0
4 2016-01-05 2016-01-06 10.0
5 2016-01-06 2016-01-07 10.0
6 2016-01-07 2016-01-08 10.0
7 2016-01-08 2016-01-09 10.0
8 2016-01-09 2016-01-10 10.0
9 2016-01-10 2016-01-11 10.0
10 2016-01-11 2016-01-12 10.0
I think this does what you want...
import pandas as pd
import datetime
first={'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,6),datetime.date(2016,1,7),
datetime.date(2016,1,8),datetime.date(2016,1,9),datetime.date(2016,1,10),datetime.date(2016,1,11)],
'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,7),datetime.date(2016,1,8),
datetime.date(2016,1,9),datetime.date(2016,1,10),datetime.date(2016,1,11),datetime.date(2016,1,12)],
'reduction':[7,5,3,2,9,3,8,3]}
df=pd.DataFrame.from_dict(first)
blank = pd.DataFrame(index=pd.date_range(df["date1"].min(), df["date2"].max()))
blank["r1"] = blank.join(df[["date1", "reduction"]].set_index("date1"), how="left")["reduction"]
blank["r2"] = blank.join(df[["date2", "reduction"]].set_index("date2"), how="left")["reduction"]
blank["r2"] = blank["r2"].shift(-1)
tmp = blank[pd.notnull(blank).any(axis=1)][pd.isnull(blank).any(axis=1)].reset_index().melt(id_vars=["index"])
tmp = tmp.sort_values(by="index").bfill()
blank1 = pd.DataFrame(index=pd.date_range(tmp["index"].min(), tmp["index"].max()))
tmp = blank1.join(tmp.set_index("index"), how="left").bfill().reset_index().groupby("index")["value"].first()
blank["r1"] = blank["r1"].combine_first(blank.join(tmp, how="left")["value"])
final = pd.DataFrame(data={"date1": blank.iloc[:-1, :].index, "date2": blank.iloc[1:, :].index, "reduction":blank["r1"].iloc[:-1].fillna(5).values})
Output:
date1 date2 reduction
0 2016-01-01 2016-01-02 7.0
1 2016-01-02 2016-01-03 5.0
2 2016-01-03 2016-01-04 7.0
3 2016-01-04 2016-01-05 7.0
4 2016-01-05 2016-01-06 5.0
5 2016-01-06 2016-01-07 3.0
6 2016-01-07 2016-01-08 2.0
7 2016-01-08 2016-01-09 9.0
8 2016-01-09 2016-01-10 3.0
9 2016-01-10 2016-01-11 8.0
10 2016-01-11 2016-01-12 3.0
I have the following dataframe with multi-index:
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
df = df.set_index(["B","A"])
Now I want to create a new columns being the maximum value of the last two values for index A. I tried the following:
df.loc[:,"D"] = df.groupby(level="A").rolling(2).max()
But it only produces N/A for the new column ("D), since the order of the grouped dataframe index is the opposite of the original dataframe.
How can I solve this? I prefer to stay away from stacking/unstacking, swaplevel/sortlevel, join or concat since I have a big dataframe and these operations tend to be quite time consuming.
You need reset_index with drop parameter for remove first level of MultiIndex:
df['D'] = df.groupby(level="A")['C'].rolling(2).max().reset_index(level=0, drop=True)
print (df)
C D
B A
2016-01-01 09:30:00 1 0 NaN
2016-01-01 09:30:01 1 1 1.0
2016-01-01 09:30:02 1 2 2.0
2016-01-01 09:30:03 1 3 3.0
2016-01-01 09:30:04 1 4 4.0
2016-01-01 09:30:05 1 5 5.0
2016-01-01 09:30:06 1 6 6.0
2016-01-01 09:30:07 1 7 7.0
2016-01-01 09:30:08 1 8 8.0
2016-01-01 09:30:09 1 9 9.0
2016-01-01 09:30:10 1 10 10.0
2016-01-01 09:30:11 1 11 11.0
2016-01-01 09:30:12 1 12 12.0
2016-01-01 09:30:13 1 13 13.0
2016-01-01 09:30:14 1 14 14.0
2016-01-01 09:30:15 1 15 15.0
2016-01-01 09:30:16 1 16 16.0
2016-01-01 09:30:17 1 17 17.0
2016-01-01 09:30:18 1 18 18.0
2016-01-01 09:30:19 1 19 19.0
2016-01-01 09:30:00 2 20 NaN
2016-01-01 09:30:01 2 21 21.0
...
...
because:
print (df.groupby(level="A")['C'].rolling(2).max())
A B A
1 2016-01-01 09:30:00 1 NaN
2016-01-01 09:30:01 1 1.0
2016-01-01 09:30:02 1 2.0
2016-01-01 09:30:03 1 3.0
2016-01-01 09:30:04 1 4.0
2016-01-01 09:30:05 1 5.0
2016-01-01 09:30:06 1 6.0
2016-01-01 09:30:07 1 7.0
2016-01-01 09:30:08 1 8.0
2016-01-01 09:30:09 1 9.0
2016-01-01 09:30:10 1 10.0
2016-01-01 09:30:11 1 11.0
2016-01-01 09:30:12 1 12.0
2016-01-01 09:30:13 1 13.0
2016-01-01 09:30:14 1 14.0
2016-01-01 09:30:15 1 15.0
2016-01-01 09:30:16 1 16.0
2016-01-01 09:30:17 1 17.0
2016-01-01 09:30:18 1 18.0
2016-01-01 09:30:19 1 19.0
2 2016-01-01 09:30:00 2 NaN
2016-01-01 09:30:01 2 21.0
...
...
I have a pandas Series that consists of numbers either 0 or 1.
2016-01-01 0
2016-01-02 1
2016-01-03 1
2016-01-04 0
2016-01-05 1
2016-01-06 1
2016-01-08 1
...
I want to make a dataframe using this Series, adding another series that provides information on how many 1s exist for a certain period of time.
For example, if the period was 5 days, then the dataframe would look like
Value 1s_for_the_last_5days
2016-01-01 0
2016-01-02 1
2016-01-03 1
2016-01-04 0
2016-01-05 1 3
2016-01-06 1 4
2016-01-08 1 4
...
In addition, I'd like to know if I can count the number of rows that are not zero, in a certain range, in a situation like the below.
Value Not_0_rows_for_the_last_5days
2016-01-01 0
2016-01-02 1.1
2016-01-03 0.4
2016-01-04 0
2016-01-05 0.6 3
2016-01-06 0.2 4
2016-01-08 10 4
Thank you for reading this. I would appreciate it if you could give me any solutions or hints on the problem.
You can use rolling for this which creates a sized window and iterates over your given column while applying an aggregation like sum.
First create some dummy data:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(0, 2, size=10),
index=pd.date_range("2016-01-01", periods=10),
name="Value")
print(ser)
2016-01-01 1
2016-01-02 0
2016-01-03 0
2016-01-04 0
2016-01-05 0
2016-01-06 0
2016-01-07 0
2016-01-08 0
2016-01-09 1
2016-01-10 0
Freq: D, Name: Value, dtype: int64
Now, use rolling:
summed = ser.rolling(5).sum()
print(summed)
2016-01-01 NaN
2016-01-02 NaN
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 1.0
2016-01-06 0.0
2016-01-07 0.0
2016-01-08 0.0
2016-01-09 1.0
2016-01-10 1.0
Freq: D, Name: Value, dtype: float64
Finally, create the resulting data frame:
df = pd.DataFrame({"Value": ser, "Summed": summed})
print(df)
Summed Value
2016-01-01 NaN 1
2016-01-02 NaN 0
2016-01-03 NaN 0
2016-01-04 NaN 0
2016-01-05 1.0 0
2016-01-06 0.0 0
2016-01-07 0.0 0
2016-01-08 0.0 0
2016-01-09 1.0 1
2016-01-10 1.0 0
In order to count arbitrary values, define your own aggregation function in conjunction with apply on the rolling window like:
# dummy function to count zeros
count_func = lambda x: (x==0).sum()
summed = ser.rolling(5).apply(count_func)
print(summed)
You may replace 0 with any value or combination of values of your original series.
pd.Series.rolling is a useful method but you can do this with a pythonic way:
def rolling_count(l,rolling_num=5,include_same_day=True):
output_list = []
for index,_ in enumerate(l):
start = index - rolling_num - int(include_same_day)
end = index + int(include_same_day)
if start < 0:
start = 0
output_list.append(sum(l[start:end]))
return output_list
data = {'Value': [0, 1, 1, 0, 1, 1, 1],
'date': ['2016-01-01','2016-01-02','2016-01-03','2016-01-04','2016-01-05','2016-01-06','2016-01-08']}
df = pd.DataFrame(data).set_index('date')
l = df['Value'].tolist()
df['1s_for_the_last_5days'] = rolling_count(df['Value'],rolling_num=5)
print(df)
Output:
Value 1s_for_the_last_5days
date
2016-01-01 0 0
2016-01-02 1 1
2016-01-03 1 2
2016-01-04 0 2
2016-01-05 1 3
2016-01-06 1 4
2016-01-08 1 5
you want rolling
s.rolling('5D').sum()
df = pd.DataFrame({'Value': s, '1s_for_the_last_5days': s.rolling('5D').sum()})
My raw data looks like the following:
start_date end_date value
0 2016-01-01 2016-01-03 2
1 2016-01-05 2016-01-08 4
The interpretation is that the data takes a value of 2 between 1/1/2016 and 1/3/2016, and it takes a value of 4 between 1/5/2016 and 1/8/2016. I want to transform the raw data to a daily time series like the following:
2016-01-01 2
2016-01-02 2
2016-01-03 2
2016-01-04 0
2016-01-05 4
2016-01-06 4
2016-01-07 4
2016-01-08 4
Note that if a date in the time series doesn't appear between the start_date and end_date in any row of the raw data, it gets a value of 0 in the time series.
I can create the time series by looping through the raw data, but that's slow. Is there a faster way to do it?
You may try this:
In [120]: df
Out[120]:
start_date end_date value
0 2016-01-01 2016-01-03 2
1 2016-01-05 2016-01-08 4
In [121]: new = pd.DataFrame({'dt': pd.date_range(df.start_date.min(), df.end_date.max())})
In [122]: new
Out[122]:
dt
0 2016-01-01
1 2016-01-02
2 2016-01-03
3 2016-01-04
4 2016-01-05
5 2016-01-06
6 2016-01-07
7 2016-01-08
In [123]: new = new.merge(df, how='left', left_on='dt', right_on='start_date').fillna(method='pad')
In [124]: new
Out[124]:
dt start_date end_date value
0 2016-01-01 2016-01-01 2016-01-03 2.0
1 2016-01-02 2016-01-01 2016-01-03 2.0
2 2016-01-03 2016-01-01 2016-01-03 2.0
3 2016-01-04 2016-01-01 2016-01-03 2.0
4 2016-01-05 2016-01-05 2016-01-08 4.0
5 2016-01-06 2016-01-05 2016-01-08 4.0
6 2016-01-07 2016-01-05 2016-01-08 4.0
7 2016-01-08 2016-01-05 2016-01-08 4.0
In [125]: new.ix[(new.dt < new.start_date) | (new.dt > new.end_date), 'value'] = 0
In [126]: new[['dt', 'value']]
Out[126]:
dt value
0 2016-01-01 2.0
1 2016-01-02 2.0
2 2016-01-03 2.0
3 2016-01-04 0.0
4 2016-01-05 4.0
5 2016-01-06 4.0
6 2016-01-07 4.0
7 2016-01-08 4.0
I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.