Combine Dataframes with Minute and Date Indexes

Combine Dataframes with Minute and Date Indexes - python

Say I have the following dataframes; one with data on 1 minute intervals, and the another with daily data intervals.
>>> df_1 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]},
index=pd.to_datetime(['2017-01-01 00:01:00', '2017-01-01 00:02:00', '2017-01-02 00:01:00']))
>>> df_1
A B C
2017-01-01 00:01:00 1 4 7
2017-01-01 00:02:00 2 5 8
2017-01-02 00:01:00 3 6 9
>>> df_2 = pd.DataFrame({'D':['ON','OFF']},
index=pd.to_datetime(['2017-01-01', '2017-01-02']))
>>> df_2
D
2017-01-01 ON
2017-01-02 OFF
And I want to merge them so that the data from df_2 gets pulled into df_1 by the date only, as below.
>>> df_merge = pd.merge(df_1, df_2)
>>> df_merge
A B C D
2017-01-01 00:01:00 1 4 7 ON
2017-01-01 00:02:00 2 5 8 ON
2017-01-02 00:01:00 3 6 9 OFF
Is there any way to merge based on date only that will pull the daily data into the minutely data?

With reindex
df_1['D']=df_2.reindex(df_1.index,method='nearest')
df_1
Out[184]:
A B C D
2017-01-01 00:01:00 1 4 7 ON
2017-01-01 00:02:00 2 5 8 ON
2017-01-02 00:01:00 3 6 9 OFF

Option 1
If your dataframes are sorted by index, then pd.merge_asof should also work:
pd.merge_asof(df_1, df_2, left_index=True, right_index=True)
A B C D
2017-01-01 00:01:00 1 4 7 ON
2017-01-01 00:02:00 2 5 8 ON
2017-01-02 00:01:00 3 6 9 OFF
Option 2
If not, then you can construct a general solution using merge on a temporary column constructed from the floored date of df_1's index:
df_1.assign(temp=df_1.index.floor('D')).merge(
df_2, left_on='temp', right_index=True
).drop('temp', 1)
A B C D
2017-01-01 00:01:00 1 4 7 ON
2017-01-01 00:02:00 2 5 8 ON
2017-01-02 00:01:00 3 6 9 OFF
Option 3
Using concat:
idx = df_1.index
pd.concat([df_1.set_index(df_1.index.floor('D')), df_2], 1).set_index(idx)
A B C D
2017-01-01 00:01:00 1 4 7 ON
2017-01-01 00:02:00 2 5 8 ON
2017-01-02 00:01:00 3 6 9 OFF

Related

pandas consecutive Boolean event rollup time series

Here's some made up time series data on 1 minute intervals:
import pandas as pd
import numpy as np
import random
random.seed(5)
rows,cols = 8760,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T')
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)
This is just some code to create some Boolean columns
df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)
df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)
On my screen this prints:
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0 0 0
2019-01-01 00:01:00 0 0 1 <---- Count as same event!
2019-01-01 00:02:00 0 0 1 <---- Count as same event!
2019-01-01 00:03:00 1 0 0
2019-01-01 00:04:00 0 0 0
What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!
If I do:
df = df.resample('H').sum()
This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!
Thanks for any tips!!

Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.
>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 4 8 4
2019-01-01 01:00:00 9 7 6
2019-01-01 02:00:00 7 14 4
2019-01-01 03:00:00 2 8 7
2019-01-01 04:00:00 4 9 5
... ... ... ...
2019-01-06 21:00:00 4 8 2
2019-01-06 22:00:00 3 11 4
2019-01-06 23:00:00 6 11 4
2019-01-07 00:00:00 8 7 8
2019-01-07 01:00:00 4 9 6
[146 rows x 3 columns]
Using your code:
>>> df.resample('H').sum()
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 5 8 5
2019-01-01 01:00:00 9 8 6
2019-01-01 02:00:00 7 14 5
2019-01-01 03:00:00 2 9 7
2019-01-01 04:00:00 4 11 5
... ... ... ...
2019-01-06 21:00:00 5 11 3
2019-01-06 22:00:00 3 15 4
2019-01-06 23:00:00 6 12 4
2019-01-07 00:00:00 8 7 10
2019-01-07 01:00:00 4 9 7
[146 rows x 3 columns]
Check:
dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00',
closed='left', freq='T')
df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
c1
2021-11-15 21:00:00 1
>>> df1.resample('H').sum()
c1
2021-11-15 21:00:00 60

Time sequence in pandas dataframe

Below is the sample of dataframe (df):-
alpha
value
0
a
5
1
a
8
2
a
4
3
b
2
4
b
1
I know how to make the sequence (numbers) as per the group:
df["serial"] = df.groupby("alpha").cumcount()+1
alpha
value
serial
0
a
5
1
1
a
8
2
2
a
4
3
3
b
2
1
4
b
1
2
But instead of number I need date-time in sequence having 30 mins interval:
Expected result:
alpha
value
serial
0
a
5
2021-01-01 23:30:00
1
a
8
2021-01-02 00:00:00
2
a
4
2021-01-02 00:30:00
3
b
2
2021-01-01 23:30:00
4
b
1
2021-01-02 00:00:00

You can simply multiply your result with a pd.Timedelta:
print ((df.groupby("alpha").cumcount()+1)*pd.Timedelta(minutes=30)+pd.Timestamp("2021-01-01 23:00:00"))
0 2021-01-01 23:30:00
1 2021-01-02 00:00:00
2 2021-01-02 00:30:00
3 2021-01-01 23:30:00
4 2021-01-02 00:00:00
dtype: datetime64[ns]

Try with to_datetime and groupby with cumcount, and then multiplying by pd.Timedelta for 30 minutes:
>>> df['serial'] = pd.to_datetime('2021-01-01 23:30:00') + df.groupby('alpha').cumcount() * pd.Timedelta(minutes=30)
>>> df
alpha value serial
0 a 5 2021-01-01 23:30:00
1 a 8 2021-01-02 00:00:00
2 a 4 2021-01-02 00:30:00
3 b 2 2021-01-01 23:30:00
4 b 1 2021-01-02 00:00:00
>>>

Stacking multiple dataframes together for different timestamp format into one timestamp

I have multiple data frames each having data varying from 1 to 1440 minute (one day).Each dataframes are alike and same columns and same length. The time column values are in hhmm format.
Lets say df_A has the data of 1st day, that is 2021-05-06 It looks like this.
>df_A
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
And the next day's data is in df_B which is also the same. The date is 2021-05-07
>df_B
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
How could I stack these together one under another and create one dataframe while identifying each rows with a column having values in format like YYYYMMDD HH:mm. Which somewhat will look like this:
>df
timestamp col1 col2..... col80
20210506 0000
20210506 0001
.
.
20210506 2359
20210507 0000
.
.
20210507 2359
How could I achieve this while dealing with multiple data frames at ones?

df_A = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_B = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_A['date'] = pd.to_datetime('2021-05-06 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_B['date'] = pd.to_datetime('2021-05-07 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_final = pd.concat([df_A, df_B])
df_final
timestamp date
0 0 2021-05-06 00:00:00
1 1 2021-05-06 00:01:00
2 2 2021-05-06 00:02:00
3 3 2021-05-06 00:03:00
4 4 2021-05-06 00:04:00
5 5 2021-05-06 00:05:00
6 6 2021-05-06 00:06:00
7 7 2021-05-06 00:07:00
8 8 2021-05-06 00:08:00
9 9 2021-05-06 00:09:00
0 0 2021-05-07 00:00:00
1 1 2021-05-07 00:01:00
2 2 2021-05-07 00:02:00
3 3 2021-05-07 00:03:00
4 4 2021-05-07 00:04:00
5 5 2021-05-07 00:05:00
6 6 2021-05-07 00:06:00
7 7 2021-05-07 00:07:00
8 8 2021-05-07 00:08:00
9 9 2021-05-07 00:09:00

Calculate delta between two columns and two following rows for different group

Are there any vector operations for improving runtime?
I found no other way besides for loops.
Sample DataFrame:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan','06-Jan', '10-Jan'],
'start_value': [12, 15, 1, 3, 2, 6],
'end_value': [20, 17, 6,19,13.5,9]})
ID start_date start_value end_value
0 1 01-Jan 12 20.0
1 1 05-Jan 15 17.0
2 1 08-Jan 1 6.0
3 2 05-Jan 3 19.0
4 2 06-Jan 2 13.5
5 2 10-Jan 6 9.0
I've tried:
import pandas as pd
df_original # contains data
data_frame_diff= pd.DataFrame()
for ID in df_original ['ID'].unique():
tmp_frame = df_original .loc[df_original ['ID']==ID]
tmp_start_value = 0
for label, row in tmp_frame.iterrows():
last_delta = tmp_start_value - row['value']
tmp_start_value = row['end_value']
row['last_delta'] = last_delta
data_frame_diff= data_frame_diff.append(row,True)
Expected Result:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan', '06-Jan',
'10-Jan'],
'last_delta': [0, 5, 16, 0, 17, 7.5]})
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
I want to calculate the delta between start_value and end_value of the timestamp and the following timestamp after for each user ID.
Is there a way to improve runtime of this code?

Use DataFrame.groupby
on ID and shift the column end_value then use Series.sub to subtract it from start_value, finally use Series.fillna and assign this new column s to the dataframe using DataFrame.assign:
s = df.groupby('ID')['end_value'].shift().sub(df['start_value']).fillna(0)
df1 = df[['ID', 'start_date']].assign(last_delta=s)
Result:
print(df1)
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5

It's a bit difficult to follow from your description what you need, but you might find this helpful:
import pandas as pd
df = (pd.DataFrame({'t1': pd.date_range(start="2020-01-01", end="2020-01-02", freq="H"),
})
.reset_index().rename(columns={'index': 'ID'})
)
df['t2'] = df['t1']+pd.Timedelta(value=10, unit="H")
df['delta_t1_t2'] = df['t2']-df['t1']
df['delta_to_previous_t1'] = df['t1'] - df['t1'].shift()
print(df)
It results in
ID t1 t2 delta_t1_t2 delta_to_previous_t1
0 0 2020-01-01 00:00:00 2020-01-01 10:00:00 10:00:00 NaT
1 1 2020-01-01 01:00:00 2020-01-01 11:00:00 10:00:00 01:00:00
2 2 2020-01-01 02:00:00 2020-01-01 12:00:00 10:00:00 01:00:00
3 3 2020-01-01 03:00:00 2020-01-01 13:00:00 10:00:00 01:00:00

pandas group by date, assign value to a column

I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks

assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0

I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combine Dataframes with Minute and Date Indexes - python

With reindex df_1['D']=df_2.reindex(df_1.index,method='nearest') df_1 Out[184]: A B C D 2017-01-01 00:01:00 1 4 7 ON 2017-01-01 00:02:00 2 5 8 ON 2017-01-02 00:01:00 3 6 9 OFF

Related

pandas consecutive Boolean event rollup time series

Time sequence in pandas dataframe

Stacking multiple dataframes together for different timestamp format into one timestamp

Calculate delta between two columns and two following rows for different group

pandas group by date, assign value to a column

Categories

Resources