Merge daily values into intraday DataFrame - python

Suppose I have two DataFrames: intraday which has one row per minute, and daily which has one row per day.
How can I add a column intraday['some_val'] where some_val is taken from the daily['some_val'] row where the intraday.index value (date component) equals the daily.index value (date component)?

Given the following setup,
intraday = pd.DataFrame(index=pd.date_range('2016-01-01', '2016-01-07', freq='T'))
daily = pd.DataFrame(index=pd.date_range('2016-01-01', '2016-01-07', freq='D'))
daily['some_val'] = np.arange(daily.shape[0])
you can create a column from the date component of both indices, and merge on that column
daily['date'] = daily.index.date
intraday['date'] = intraday.index.date
daily.merge(intraday)
date some_val
0 2016-01-01 0
1 2016-01-01 0
2 2016-01-01 0
3 2016-01-01 0
4 2016-01-01 0
... ... ...
8636 2016-01-06 5
8637 2016-01-06 5
8638 2016-01-06 5
8639 2016-01-06 5
8640 2016-01-07 6
Alternatively, you can take advantage of automatic index alignment, and use fillna.
intraday['some_val'] = daily['some_val']
intraday.fillna(method='ffill', downcast='infer')
some_val
2016-01-01 00:00:00 0
2016-01-01 00:01:00 0
2016-01-01 00:02:00 0
2016-01-01 00:03:00 0
2016-01-01 00:04:00 0
... ...
2016-01-06 23:56:00 5
2016-01-06 23:57:00 5
2016-01-06 23:58:00 5
2016-01-06 23:59:00 5
2016-01-07 00:00:00 6
Note that this only works if the time component of your daily index is 00:00.

Related

Transform the Random time intervals to 30 mins Structured interval

I have this dataFrame where some tasks happened time period
Date Start Time End Time
0 2016-01-01 0:00:00 2016-01-01 0:10:00 2016-01-01 0:25:00
1 2016-01-01 0:00:00 2016-01-01 1:17:00 2016-01-01 1:31:00
2 2016-01-02 0:00:00 2016-01-02 0:30:00 2016-01-02 0:32:00
... ... ... ...
Convert this df to 30 mins interval
Expected outcome
Date Hours
1 2016-01-01 0:30:00 0:15
2 2016-01-01 1:00:00 0:00
3 2016-01-01 1:30:00 0:13
4 2016-01-01 2:00:00 0:01
5 2016-01-01 2:30:00 0:00
6 2016-01-01 3:00:00 0:00
... ...
47 2016-01-01 23:30:00 0:00
48 2016-01-02 23:59:59 0:00
49 2016-01-02 00:30:00 0:00
50 2016-01-02 01:00:00 0:02
... ...
I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.
IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):
print (df.assign(Diff=df["End Time"]-df["Start Time"])
.groupby(pd.Grouper(key="Start Time", freq="30T"))
.agg({"Diff": "first"})
.fillna(pd.Timedelta(seconds=0)))
Diff
Start Time
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00
The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index. The last line of code is to get the proper format in the Hours column.
#create a series of 0 with a datetime index
res = pd.Series(data=0,
index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(),
df['End Time'].max(),
freq='T'),
name='Dates'),
name='Hours')
# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1
# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
.resample('30T', label='right').sum()
.reset_index('Dates')
)
# change the format of the Hours column, honestly not necessary
res['Hours'] = pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time
print(res)
Dates Hours
0 2016-01-01 00:30:00 00:15
1 2016-01-01 01:00:00 00:00
2 2016-01-01 01:30:00 00:13
3 2016-01-01 02:00:00 00:01
4 2016-01-01 02:30:00 00:00
5 2016-01-01 03:00:00 00:00
...
48 2016-01-02 00:30:00 00:00
49 2016-01-02 01:00:00 00:02

How do I fill in missing dates with zeros for a pandas groupby list?

I am looking to take a daily record of transactions and account for days when 0 transactions occurred.
Here is my initial dataframe:
df.head()
tr_timestamp text location
2016-01-01 cookies TX
2016-01-01 pizza TX
2016-01-04 apples TX
2016-01-08 bread TX
When I run a group by day, I get the following:
df_by_day = df['tr_timestamp'].groupby(df.tr_timestamp).count()
df_by_day
tr_timestamp
2016-01-01 2
2016-01-04 1
2016-01-08 1
I'm looking to use Python/Pandas where dates without a transaction are filled such that I get the following output:
df_by_day_filled
tr_timestamp
2016-01-01 2
2016-01-02 0
2016-01-03 0
2016-01-04 1
2016-01-05 0
2016-01-06 0
2016-01-07 0
2016-01-08 1
I've tried the following answers, which don't quite give the output I need returned:
Pandas groupby for zero values
Fill Missing Dates in DataFrame with Duplicate Dates in Groupby
Thanks.
You can also try:
df_by_day.asfreq('D', fill_value=0)
Output:
tr_timestamp
2016-01-01 2
2016-01-02 0
2016-01-03 0
2016-01-04 1
2016-01-05 0
2016-01-06 0
2016-01-07 0
2016-01-08 1
Freq: D, Name: tr_timestamp, dtype: int64
This is a resample operation:
df.set_index(pd.to_datetime(df.pop('tr_timestamp'))).resample('D')['text'].count()
tr_timestamp
2016-01-01 2
2016-01-02 0
2016-01-03 0
2016-01-04 1
2016-01-05 0
2016-01-06 0
2016-01-07 0
2016-01-08 1
Freq: D, Name: text, dtype: int64
The pd.to_datetime call ensures this works if "tr_timestamp" is not a datetime. If it is, then the solution simplifies to
df.dtypes
tr_timestamp datetime64[ns]
text object
location object
dtype: object
df.set_index('tr_timestamp').resample('D')['text'].count()
tr_timestamp
2016-01-01 2
2016-01-02 0
2016-01-03 0
2016-01-04 1
2016-01-05 0
2016-01-06 0
2016-01-07 0
2016-01-08 1
Freq: D, Name: text, dtype: int64

How can I count the number of rows that are not zero in a certain range in python?

I have a pandas Series that consists of numbers either 0 or 1.
2016-01-01 0
2016-01-02 1
2016-01-03 1
2016-01-04 0
2016-01-05 1
2016-01-06 1
2016-01-08 1
...
I want to make a dataframe using this Series, adding another series that provides information on how many 1s exist for a certain period of time.
For example, if the period was 5 days, then the dataframe would look like
Value 1s_for_the_last_5days
2016-01-01 0
2016-01-02 1
2016-01-03 1
2016-01-04 0
2016-01-05 1 3
2016-01-06 1 4
2016-01-08 1 4
...
In addition, I'd like to know if I can count the number of rows that are not zero, in a certain range, in a situation like the below.
Value Not_0_rows_for_the_last_5days
2016-01-01 0
2016-01-02 1.1
2016-01-03 0.4
2016-01-04 0
2016-01-05 0.6 3
2016-01-06 0.2 4
2016-01-08 10 4
Thank you for reading this. I would appreciate it if you could give me any solutions or hints on the problem.
You can use rolling for this which creates a sized window and iterates over your given column while applying an aggregation like sum.
First create some dummy data:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(0, 2, size=10),
index=pd.date_range("2016-01-01", periods=10),
name="Value")
print(ser)
2016-01-01 1
2016-01-02 0
2016-01-03 0
2016-01-04 0
2016-01-05 0
2016-01-06 0
2016-01-07 0
2016-01-08 0
2016-01-09 1
2016-01-10 0
Freq: D, Name: Value, dtype: int64
Now, use rolling:
summed = ser.rolling(5).sum()
print(summed)
2016-01-01 NaN
2016-01-02 NaN
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 1.0
2016-01-06 0.0
2016-01-07 0.0
2016-01-08 0.0
2016-01-09 1.0
2016-01-10 1.0
Freq: D, Name: Value, dtype: float64
Finally, create the resulting data frame:
df = pd.DataFrame({"Value": ser, "Summed": summed})
print(df)
Summed Value
2016-01-01 NaN 1
2016-01-02 NaN 0
2016-01-03 NaN 0
2016-01-04 NaN 0
2016-01-05 1.0 0
2016-01-06 0.0 0
2016-01-07 0.0 0
2016-01-08 0.0 0
2016-01-09 1.0 1
2016-01-10 1.0 0
In order to count arbitrary values, define your own aggregation function in conjunction with apply on the rolling window like:
# dummy function to count zeros
count_func = lambda x: (x==0).sum()
summed = ser.rolling(5).apply(count_func)
print(summed)
You may replace 0 with any value or combination of values of your original series.
pd.Series.rolling is a useful method but you can do this with a pythonic way:
def rolling_count(l,rolling_num=5,include_same_day=True):
output_list = []
for index,_ in enumerate(l):
start = index - rolling_num - int(include_same_day)
end = index + int(include_same_day)
if start < 0:
start = 0
output_list.append(sum(l[start:end]))
return output_list
data = {'Value': [0, 1, 1, 0, 1, 1, 1],
'date': ['2016-01-01','2016-01-02','2016-01-03','2016-01-04','2016-01-05','2016-01-06','2016-01-08']}
df = pd.DataFrame(data).set_index('date')
l = df['Value'].tolist()
df['1s_for_the_last_5days'] = rolling_count(df['Value'],rolling_num=5)
print(df)
Output:
Value 1s_for_the_last_5days
date
2016-01-01 0 0
2016-01-02 1 1
2016-01-03 1 2
2016-01-04 0 2
2016-01-05 1 3
2016-01-06 1 4
2016-01-08 1 5
you want rolling
s.rolling('5D').sum()
df = pd.DataFrame({'Value': s, '1s_for_the_last_5days': s.rolling('5D').sum()})

Convert Date Ranges to Time Series in Pandas

My raw data looks like the following:
start_date end_date value
0 2016-01-01 2016-01-03 2
1 2016-01-05 2016-01-08 4
The interpretation is that the data takes a value of 2 between 1/1/2016 and 1/3/2016, and it takes a value of 4 between 1/5/2016 and 1/8/2016. I want to transform the raw data to a daily time series like the following:
2016-01-01 2
2016-01-02 2
2016-01-03 2
2016-01-04 0
2016-01-05 4
2016-01-06 4
2016-01-07 4
2016-01-08 4
Note that if a date in the time series doesn't appear between the start_date and end_date in any row of the raw data, it gets a value of 0 in the time series.
I can create the time series by looping through the raw data, but that's slow. Is there a faster way to do it?
You may try this:
In [120]: df
Out[120]:
start_date end_date value
0 2016-01-01 2016-01-03 2
1 2016-01-05 2016-01-08 4
In [121]: new = pd.DataFrame({'dt': pd.date_range(df.start_date.min(), df.end_date.max())})
In [122]: new
Out[122]:
dt
0 2016-01-01
1 2016-01-02
2 2016-01-03
3 2016-01-04
4 2016-01-05
5 2016-01-06
6 2016-01-07
7 2016-01-08
In [123]: new = new.merge(df, how='left', left_on='dt', right_on='start_date').fillna(method='pad')
In [124]: new
Out[124]:
dt start_date end_date value
0 2016-01-01 2016-01-01 2016-01-03 2.0
1 2016-01-02 2016-01-01 2016-01-03 2.0
2 2016-01-03 2016-01-01 2016-01-03 2.0
3 2016-01-04 2016-01-01 2016-01-03 2.0
4 2016-01-05 2016-01-05 2016-01-08 4.0
5 2016-01-06 2016-01-05 2016-01-08 4.0
6 2016-01-07 2016-01-05 2016-01-08 4.0
7 2016-01-08 2016-01-05 2016-01-08 4.0
In [125]: new.ix[(new.dt < new.start_date) | (new.dt > new.end_date), 'value'] = 0
In [126]: new[['dt', 'value']]
Out[126]:
dt value
0 2016-01-01 2.0
1 2016-01-02 2.0
2 2016-01-03 2.0
3 2016-01-04 0.0
4 2016-01-05 4.0
5 2016-01-06 4.0
6 2016-01-07 4.0
7 2016-01-08 4.0

rolling_sum on business day and return new dataframe with date as index

I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.

Categories

Resources