Select previous row every hour in pandas - python

I am trying to obtain the closest previous data point every hour in a pandas data frame. For example:
time value
0 14:59:58 15
1 15:00:10 20
2 15:57:42 14
3 16:00:30 9
would return
time value
0 15:00:00 15
1 16:00:00 14
i.e. rows 0 and 2 of the original data frame. How would I go about doing so? Thanks!

With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{"time": ["14:59:58", "15:00:10", "15:57:42", "16:00:30"], "value": [15, 20, 14, 9]}
)
Here is one way to do it:
# Setup
df["time"] = pd.to_datetime(df["time"], format="%H:%M:%S")
temp_df = pd.DataFrame(df["time"].dt.round("H").drop_duplicates()).assign(value=pd.NA)
# Add round hours to df, find nearest data points and drop previous hours
new_df = (
pd.concat([df, temp_df])
.sort_values(by="time")
.fillna(method="ffill")
.pipe(lambda df_: df_[~df_["time"].isin(df["time"])])
.reset_index(drop=True)
)
# Cleanup
new_df["time"] = new_df["time"].dt.time
print(new_df)
# Output
time value
0 15:00:00 15
1 16:00:00 14

Related

How to shift a column by 1 year in Python

With the python shift function, you are able to offset values by the number of rows. I'm looking to offset values by a specified time, which is 1 year in this case.
Here is my sample data frame. The value_py column is what I'm trying to return with a shift function. This is an over simplified example of my problem. How do I specify date as the offset parameter and not use rows?
import pandas as pd
import numpy as np
test_df = pd.DataFrame({'dt':['2020-01-01', '2020-08-01', '2021-01-01', '2022-01-01'],
'value':[10,13,15,14]})
test_df['dt'] = pd.to_datetime(test_df['dt'])
test_df['value_py'] = [np.nan, np.nan, 10, 15]
I have tried this but I'm seeing the index value get shifted by 1 year and not the value column
test_df.set_index('dt')['value'].shift(12, freq='MS')
This should solve your problem:
test_df['new_val'] = test_df['dt'].map(test_df.set_index('dt')['value'].shift(12, freq='MS'))
test_df
dt value value_py new_val
0 2020-01-01 10 NaN NaN
1 2020-08-01 13 NaN NaN
2 2021-01-01 15 10.0 10.0
3 2022-01-01 14 15.0 15.0
Use .map() to map the values of the shifted dates to original dates.
Also you should use 12 as your shift parameter not -12.

i have datetime column(hh:mm:ss ) format in dataframe.i want to pivot dataframe in which i want to use aggfunc on date column

I am trying to pivot dataframe given as below. I have datetime column(hh:mm:ss) format in dataframe. I want to pivot dataframe in which I want to use aggfunc on date column.
import pandas as pd
data = {'Type':['A', 'B', 'C', 'C'],'Name':['ab', 'ef','gh', 'ij'],'Time':['02:00:00', '03:02:00', '04:00:30','01:02:20']}
df = pd.DataFrame(data)
print (df)
pivot = (
df.pivot_table(index=['Type'],values=['Time'], aggfunc='sum')
)
Type
Name
Time
A
ab
02:00:00
B
ef
03:02:00
C
gh
04:00:30
C
ij
01:02:20
Type
Time
C
04:00:3001:02:20
A
02:00:00
B
03:02:00
I want C row should be addition of two time ; 05:02:50
This looks more like groupby sum than pivot_table.
Convert to_timedelta to have appropriate dtype for duration. (Makes mathmatical operations function as expected)
groupby sum on Type and Time to get the total duration per Type.
# Convert to TimeDelta (appropriate dtype)
df['Time'] = pd.to_timedelta(df['Time'])
new_df = df.groupby('Type')['Time'].sum().reset_index()
new_df:
Type Time
0 A 0 days 02:00:00
1 B 0 days 03:02:00
2 C 0 days 05:02:50
Optional convert back to string:
new_df['Time'] = new_df['Time'].dt.to_pytimedelta().astype(str)
new_df:
Type Time
0 A 2:00:00
1 B 3:02:00
2 C 5:02:50

Filling Pandas column between timestamp boundaries

Let's consider a dataframe with a column of timestamp and a second column of measured values.
import pandas as pd
data = {'Time': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'Value': [13,54,68,94,31,68,46,46,31,53,54,85,75,42,69]}
df = pd.DataFrame (data, columns = ['Time','Value'])
We want to filter the dataframe to keep only the values at specific timings.
start = [2, 9, 14]
end = [5, 12, 15]
In this case, we have 3 timeframes we want to keep; from 2s to 5s, from 9s to 12s, and from 14s to 15s.
I created a column that marks the boundaries of the timeframes we want to keep.
df.loc[df["Time"].isin(start), "Observation"] = 'Start'
df.loc[df["Time"].isin(end), "Observation"] = 'End'
For filtering the rows, I was thinking of filling the cells between Start and End, and remove the empty rows. And this is where I'm stuck.
I had a go with using:
df = df.fillna(method='ffill')
The issue with this approach is that I only need this fill to be applied to start (to populate the inside of the timeframe of observation) but I don't want to fill after "End".
My first idea was to create another set of timestamp that would take the timestamp of the end of a session and add 1 to it:
import pandas as pd
data = {'Time': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'Value': [13,54,68,94,31,68,46,46,31,53,54,85,75,42,69]}
df = pd.DataFrame (data, columns = ['Time','Value'])
start = [2, 9, 14]
end = [5, 12, 15]
out = [x+1 for x in end]
df.loc[df["Time"].isin(start), "Observation"] = 'Start'
df.loc[df["Time"].isin(end), "Observation"] = 'End'
df.loc[df["Time"].isin(out), "Observation"] = 'Out'
df = df.fillna(method='ffill')
The issue with this approach is that, for the problem I need to solve, the timestamps are not seconds at regular intervals. It is milliseconds at random intervals, so using this +1 to create the "Out" tag is not a reliable method, and it feels I'm overcomplicating something that should be simple; just keeping the observations between the start timestamps and the end timestamps (both timestamps included).
Using a filter (filter/select rows of pandas dataframe by timestamp column) could be an option. However, depending on the session I'm looking at, there can be a random amount of timeframes of interest. I wanted to try and use a for loop scanning through the list of start timestamps and the list of end timestamps to dynamically create such filter, but I didn't manage to get this working.
If anyone knows of a function that does exactly what I need, or that has any tip, that would be great.
Thank you.
How about creating function that zips your start and end lists and checks whether the element is within given pair of values:
def catch_df(start, end, element):
start_end = zip(start, end)
for i, z in enumerate(start_end):
if element >= z[0] and element <= z[1]:
return "df{}".format(i)
and apply that function to values stored in dataframe df:
df['Result'] = df['Time'].apply(lambda x: catch_df(start, end, x))
so as result you receive following dataframe, which could be easily filtered for None values etc:
Time Value Observation Result
0 1 13 NaN None
1 2 54 Start df0
2 3 68 NaN df0
3 4 94 NaN df0
4 5 31 End df0
5 6 68 NaN None
6 7 46 NaN None
7 8 46 NaN None
8 9 31 Start df1
9 10 53 NaN df1
10 11 54 NaN df1
11 12 85 End df1
12 13 75 NaN None
13 14 42 Start df2
14 15 69 End df2

pandas: Conditionally Aggregate Consecutive Rows

I have a dataframe with a consecutive index (date for every calendar day) and a reference vector that does not contain every date (only working days).
I want to reindex the dataframe to only the dates in the reference vector with the missing data being aggregated to the latest entry before a missing-date-section (i.e. weekend data shall be aggregated together to the last Friday).
Currently I have implemented this by looping over the reversed index and collecting the weekend data, then adding it later in the loop. I'm asking if there is a more efficient "array-way" to do it.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.arange(10), 'y': np.arange(10)**2},
index=pd.date_range(start="2018-01-01", periods=10))
print(df)
ref_dates = pd.date_range(start="2018-01-01", periods=10)
ref_dates = ref_dates[:5].append(ref_dates[7:]) # omit 2018-01-06 and -07
# inefficient approach by reverse-traversing the dates, collecting the data
# and aggregating it together with the first date that's in ref_dates
df.sort_index(ascending=False, inplace=True)
collector = []
for dt in df.index:
if collector and dt in ref_dates:
# data from previous iteration was collected -> aggregate it and reset collector
# first append also the current data
collector.append(df.loc[dt, :].values)
collector = np.array(collector)
# applying aggregation function, here sum as example
aggregates = np.sum(collector, axis=0)
# setting the new data
df.loc[dt,:] = aggregates
# reset collector
collector = []
if dt not in ref_dates:
collector.append(df.loc[dt, :].values)
df = df.reindex(ref_dates)
print(df)
Gives the output (first: source dataframe, second: target dataframe)
x y
2018-01-01 0 0
2018-01-02 1 1
2018-01-03 2 4
2018-01-04 3 9
2018-01-05 4 16
2018-01-06 5 25
2018-01-07 6 36
2018-01-08 7 49
2018-01-09 8 64
2018-01-10 9 81
x y
2018-01-01 0 0
2018-01-02 1 1
2018-01-03 2 4
2018-01-04 3 9
2018-01-05 15 77 # contains the sum of Jan 5th, 6th and 7th
2018-01-08 7 49
2018-01-09 8 64
2018-01-10 9 81
Still has a list comprehension loop, but works.
import pandas as pd
import numpy as np
# Create dataframe which contains all days
df = pd.DataFrame({'x': np.arange(10), 'y': np.arange(10)**2},
index=pd.date_range(start="2018-01-01", periods=10))
# create second dataframe which only contains week-days or whatever dates you need.
ref_dates = [x for x in df.index if x.weekday() < 5]
# Set the index of df to a forward filled version of the ref days
df.index = pd.Series([x if x in ref_dates else float('nan') for x in df.index]).fillna(method='ffill')
# Group by unique dates and sum
df = df.groupby(level=0).sum()
print(df)

How to correctly perform all-VS-all row-by-row comparisons between series in two pandas dataframes?

I have two pandas dataframes, df1 and df2. Both contain time series data.
df1
Event Number Timestamp_A
A 1 7:00
A 2 8:00
A 3 9:00
df2
Event Number Timestamp_B
B 1 9:01
B 2 8:01
B 3 7:01
Basically, I want to determine the Event B which is closest to Event A, and assign this correctly.
Therefore, I need to substract (1) every Timestamp_B in df2 from ever Timestamp_A in df1, row by row. This results in a series of values, of which I want to take the minumum and put it to a new column in df1.
Event Number Timestamp_A Closest_Timestamp_B
A 1 7:00 7:01
A 2 8:00 8:01
A 3 9:00 9:01
I am not familiar with row-by-row operations in pandas.
When I am doing:
for index, row in df1.iterrows():
s = df1.Timestamp_A.values - df2["Timestamp_B"][:]
Closest_Timestamp_B = s.min()
The result I get is a ValueError:
ValueError: operands could not be broadcast together with shapes(3,) (4,)
How to correctly perform row-by-row comparisons between two pandas dataframes?
There might be a better way to do this but here is one way:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Event':['A','A','A'],'Number':[1,2,3],
'Timestamp_A':['7:00','8:00','9:00']})
df2 = pd.DataFrame({'Event':['B','B','B'],'Number':[1,2,3],
'Timestamp_B':['7:01','8:01','9:01']})
df1['Closest_timestamp_B'] = np.zeros(len(df1.index))
for index, row in df1.iterrows():
df1['Closest_timestamp_B'].iloc[index] = df2.Timestamp_B.loc[np.argmin(np.abs(pd.to_datetime(df2.Timestamp_B) -pd.to_datetime(row.Timestamp_A)))]
df1
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01
Your best bet is to use the underlying numpy data structure to create a matrix of Timestamp_A by Timestamp_B. Since you need to compare every event in A to every event in B, this is an O(N^2) calculation, well suited for a matrix.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['A',1,'7:00'],
['A',2,'8:00'],
['A',3,'9:00']], columns=['Event', 'Number', 'Timestamp_A'])
df2 = pd.DataFrame([['B',1,'9:01'],
['B',2,'8:01'],
['B',3,'7:01']], columns=['Event', 'Number', 'Timestamp_B'])
df1.Timestamp_A = pd.to_datetime(df1.Timestamp_A)
df2.Timestamp_B = pd.to_datetime(df2.Timestamp_B)
# create a matrix with the index of df1 as the row index, and the index
# of df2 as the column index
M = df1.Timestamp_A.values.reshape((len(df1),1)) - df2.Timestamp_B.values
# use argmin to find the index of the lowest value (after abs())
index_of_B = np.abs(M).argmin(axis=0)
df1['Closest_timestamp_B'] = df2.Timestamp_B[index_of_B]
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 2017-07-05 07:00:00 2017-07-05 09:01:00
1 A 2 2017-07-05 08:00:00 2017-07-05 08:01:00
2 A 3 2017-07-05 09:00:00 2017-07-05 07:01:00
If you want to return to the original formatting for the timestamps, you can use:
df1.Timestamp_A = df1.Timestamp_A.dt.strftime('%H:%M').str.replace(r'^0','')
df1.Closest_timestamp_B = df1.Closest_timestamp_B.dt.strftime('%H:%M').str.replace(r'^0','')
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 9:01
1 A 2 8:00 8:01
2 A 3 9:00 7:01
What about using merge_asof to get the closest events?
Make sure your data types are correct:
df1.Timestamp_A = df1.Timestamp_A.apply(pd.to_datetime)
df2.Timestamp_B = df2.Timestamp_B.apply(pd.to_datetime)
Sort by the times:
df1.sort_values('Timestamp_A', inplace=True)
df2.sort_values('Timestamp_B', inplace=True)
Now you can merge the two dataframes on the closest time:
df3 = pd.merge_asof(df2, df1,
left_on='Timestamp_B',
right_on='Timestamp_A',
suffixes=('_df2', '_df1'))
#clean up the datetime formats
df3[['Timestamp_A', 'Timestamp_B']] = df3[['Timestamp_A', 'Timestamp_B']] \
.applymap(pd.datetime.time)
#put df1 columns on the right
df3 = df3.iloc[:,::-1]
print(df3)
Timestamp_A Number_df1 Event_df1 Timestamp_B Number_df2 Event_df2
0 07:00:00 1 A 07:01:00 3 B
1 08:00:00 2 A 08:01:00 2 B
2 09:00:00 3 A 09:01:00 1 B
Use apply to compare Timestamp_A on each row with all Timestamp_B and get the index of the row with min diff, then extract Timestamp_B using the index.
df1['Closest_Timestamp_B'] = (
df1.apply(lambda x: abs(pd.to_datetime(x.Timestamp_A).value -
df2.Timestamp_B.apply(lambda x: pd.to_datetime(x).value))
.idxmin(),axis=1)
.apply(lambda x: df2.Timestamp_B.loc[x])
)
df1
Out[271]:
Event Number Timestamp_A Closest_Timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01

Categories

Resources