Deleting values conditional on large values of another column - python

I have a timeseries df comprised of daily Rates in column A and the relative change from one day to the next in column B.
DF looks something like the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 20.2% 292%
May/28/2019 20.5% 1.4%
May/29/2019 20% -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
I would like to delete all values in column A which occur between between large relative shifts,> +/- 50%.
So the above DF should look as the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 np.nan 292%
May/28/2019 np.nan 1.4%
May/29/2019 np.nan -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
This is where I've got to so far.... would appreciate some help
for i, j in df1.iterrows():
if df1['Shift'][i] > .50 :
x = df1['IR'][i]
if df1['Shift'][j] < -.50 :
y = df1['IR'][j]
df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'],
np.nan)
Error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

We can locate rows between pairs ([1st-2nd), [3rd-4th), ...) of outlier values to then mask the entire DataFrame at once.
Setup
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))
IR Shift
May/24/2019 5.9 NaN
May/25/2019 6.0 1.67
May/26/2019 5.9 -1.67
May/27/2019 20.2 292.00
May/28/2019 20.5 1.40
May/29/2019 20.0 -1.60
May/30/2019 5.1 -292.00
May/31/2019 5.1 0.00
Code
# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)
# Get the indices between consecutive pairs.
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1
df.loc[m, 'IR'] = np.NaN
# IR Shift
#May/24/2019 5.9 NaN
#May/25/2019 6.0 1.67
#May/26/2019 5.9 -1.67
#May/27/2019 NaN 292.00
#May/28/2019 NaN 1.40
#May/29/2019 NaN -1.60
#May/30/2019 5.1 -292.00
#May/31/2019 5.1 0.00
Here I've added a few more rows to show how this will behave in the case of multiple spikes. IR_modified is how IR will be masked with the above logic.
IR Shift IR_modified
May/24/2019 5.9 NaN 5.9
May/25/2019 6.0 1.67 6.0
May/26/2019 5.9 -1.67 5.9
May/27/2019 20.2 292.00 NaN
May/28/2019 20.5 1.40 NaN
May/29/2019 20.0 -1.60 NaN
May/30/2019 5.1 -292.00 5.1
May/31/2019 5.1 0.00 5.1
June/1/2019 7.0 415.00 NaN
June/2/2019 17.0 15.00 NaN
June/3/2019 27.0 12.00 NaN
June/4/2019 17.0 315.00 17.0
June/5/2019 7.0 -12.00 7.0

You can also np.where function from numpy as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)
In [8]: df
Out[8]:
Date IR Shift
0 2019-05-24 NaN NaN
1 2019-05-25 0.0167 0.0167
2 2019-05-26 NaN -0.0167
3 2019-05-27 2.9200 2.9200
4 2019-05-28 0.0140 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 NaN -2.9200

Here's an attempt. There could be more "proper" ways to do it but I'm not familiar with all the pandas built-in functions.
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 0.202 2.9200
4 2019-05-28 0.205 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 0.051 -2.9200
df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 NaN -2.9200

Using df.at to access a single value for a row/column label pair.
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})
print("DataFrame Before :")
print(df)
count = 1
while (count < len(df.index)):
if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
df.at[count, 'IR'] = np.nan
count = count + 1
print("DataFrame After :")
print(df)
Output of program:
DataFrame Before :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 20.2 292.00
4 2019-05-28 20.5 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 5.1 -292.00
7 2019-05-31 5.1 0.00
DataFrame After :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 NaN 292.00
4 2019-05-28 NaN 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 NaN -292.00
7 2019-05-31 NaN 0.00

As per your description of triggering this on any large shift, positive or negative, you could do this:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 0.051 -2.9200
Steps:
abs(df.Shift) > .5: Find shift of above +/- 50%
.cumsum(): Gives unique values to each period, where the odd numbered periods are the ones we want to omit.
% 2 == 1: Checks which rows have odd numbers for cumsum().
Note: This does not work if what you want is to constrain this so that every positive spike needs to be followed by a negative spike, or vice versa.

Was not sure about your shift, so calculated again. Does this works for you?
import pandas as pd
import numpy as np
df.drop(columns=['Shift'], inplace=True) ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)
def shift(current, previous):
return (current-previous)/previous * 100
indexlist=[] ## to save index that will be set to null
prior=0 ## temporary flag to store value prior to a peak
flag=False
for index, row in df.iterrows():
if index==0: ## to skip first row of data
continue
if flag==False and (shift(row[1], row[2])) > 50: ## to check for start of peak
prior=row[2]
indexlist.append(index)
flag=True
continue
if flag==True: ## checking until when the peak lasts
if (shift(row[1], prior)) > 50:
indexlist.append(index)
df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan
Output on print(df)
date IR nextval
0 May/24/2019 5.9 NaN
1 May/25/2019 6.0 5.9
2 May/26/2019 5.9 6.0
3 May/27/2019 NaN 5.9
4 May/28/2019 NaN 20.2
5 May/29/2019 NaN 20.5
6 May/30/2019 5.1 20.0
7 May/31/2019 5.1 5.1

df.loc[df['Shift']>0.5,'IR'] = np.nan

Related

Averaging values with if else statement of a Pandas DataFrame and creating a new resulting DataFrame

I have a df which looks like this:
A B C
5.1 1.1 7.3
5.0 0.3 7.2
4.9 1.7 7.0
10.2 1.1 7.9
10.3 1.0 7.0
15.4 2.0 7.1
15.1 1.0 7.3
0.0 0.9 7.3
0.0 1.3 7.9
0.0 0.5 7.5
-5.1 1.0 7.3
-10.3 0.8 7.3
-10.1 1.0 7.1
I want to detect the range from column "A" and get the mean and std for all the columns and save the result in a new df.
Expected Output:
mean_A Std_A mean_B Std_B mean_C Std_C
5.0 ... 1.03 ... 7.17 ...
10.25 ... 1.05 ... 7.45 ...
... ... ... ... ... ...
So, I want to get the average from group of data based on column "A".
I am new to Python and SO. I hope I was able to explain my goal.
Groups are defined by difference of values in A is greater like 5, pass to GroupBy.agg and aggregate mean with std:
df = df.groupby(df.A.diff().abs().gt(5).cumsum()).agg(['mean','std'])
df.columns = df.columns.map(lambda x: f'{x[1]}_{x[0]}')
print (df)
mean_A std_A mean_B std_B mean_C std_C
A
0 5.00 0.100000 1.033333 0.702377 7.166667 0.152753
1 10.25 0.070711 1.050000 0.070711 7.450000 0.636396
2 15.25 0.212132 1.500000 0.707107 7.200000 0.141421
3 0.00 0.000000 0.900000 0.400000 7.566667 0.305505
4 -5.10 NaN 1.000000 NaN 7.300000 NaN
5 -10.20 0.141421 0.900000 0.141421 7.200000 0.141421

How to find outliers and invalid count for each row in a pandas dataframe

I have a pandas dataframe that looks like this:
X Y Z
0 9.5 -2.3 4.13
1 17.5 3.3 0.22
2 NaN NaN -5.67
...
I want to add 2 more columns. Is invalid and Is Outlier.
Is Invalid will just keep a track of the invalid/NaN values in that given row. So for the 2nd row, Is Invalid will have a value of 2. For rows with valid entries, Is Invalid will display 0.
Is Outlier will just check whether that given row has outlier data. This will just be True/False.
At the moment, this is my code:
dt = np.fromfile(path, dtype='float')
df = pd.DataFrame(dt.reshape(-1, 3), column = ['X', 'Y', 'Z'])
How can I go about adding these features?
x='''Z,Y,X,W,V,U,T
1,2,3,4,5,6,60
17.5,3.3,.22,22.11,-19,44,0
,,-5.67,,,,
'''
import pandas as pd, io, scipy.stats
df = pd.read_csv(io.StringIO(x))
df
Sample input:
Z Y X W V U T
0 1.0 2.0 3.00 4.00 5.0 6.0 60.0
1 17.5 3.3 0.22 22.11 -19.0 44.0 0.0
2 NaN NaN -5.67 NaN NaN NaN NaN
Transformations:
df['is_invalid'] = df.isna().sum(axis=1)
df['is_outlier'] = df.iloc[:,:-1].apply(lambda r: (r < (r.quantile(0.25) - 1.5*scipy.stats.iqr(r))) | ( r > (r.quantile(0.75) + 1.5*scipy.stats.iqr(r))) , axis=1).sum(axis = 1)
df
Final output:
Z Y X W V U T is_invalid is_outlier
0 1.0 2.0 3.00 4.00 5.0 6.0 60.0 0 1
1 17.5 3.3 0.22 22.11 -19.0 44.0 0.0 0 0
2 NaN NaN -5.67 NaN NaN NaN NaN 6 0
Explanation for outlier:
Valid range is from Q1-1.5IQR to Q3+1.5IQR
Since it needs to calculated per row, we used apply and pass each row (r). To count outliers, we flipped the range i.e. anything less than Q1-1.5IQR and greater than Q3+1.5IQR is counted.

maximum sum of consecutive n-days using pandas

I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6

Remove higher level index names after pivot [duplicate]

This question already has answers here:
flattern pandas dataframe column levels [duplicate]
(1 answer)
How to flatten a hierarchical index in columns
(19 answers)
Closed 1 year ago.
I have the following dataframe:
dates = [str(datetime.datetime(2020, 1, 1, 0, 0, 0, 0) + datetime.timedelta(days=i)) for i in range(3)]
repetitions = [3, 6, 4]
dates = [i for i, j in zip(dates, repetitions) for k in range(j)]
cities_ = ['Paris', 'Tokyo', 'Sydney', 'New-York', 'Rio', 'Berlin']
cities = [cities_[0: repetitions[i]] for i in range(len(repetitions))]
cities = [i for j in cities for i in j]
temperatures = [round(random.normalvariate(20, 5), 1) for _ in range(len(cities))]
humidities = [round(random.normalvariate(0.5, 0.4), 1) for _ in range(len(cities))]
humidities = [min(i, 1) for i in humidities]
humidities = [max(i, 0) for i in humidities]
df = pd.DataFrame(data=list(zip(dates, cities, temperatures, humidities)), columns=['date', 'city', 'temperature', 'humidity'])
I need to remove the indexes after applying the pivot function; the code below
values = ['temperature', 'humidity']
df_ = df.pivot(index='date', columns='city', values=values)
Col = list(set(df['city'].values))
for value in values:
df_.rename(columns={i: value + '_' + i for i in Col}, inplace=True)
outputs:
temperature ... humidity
city temperature_Berlin temperature_New-York temperature_Paris temperature_Rio ... temperature_Paris temperature_Rio temperature_Sydney temperature_Tokyo
date ...
2020-01-01 00:00:00 NaN NaN 21.2 NaN ... 0.3 NaN 1.0 1.0
2020-01-02 00:00:00 18.4 14.2 19.3 28.7 ... 0.6 0.6 0.1 0.2
2020-01-03 00:00:00 NaN 31.6 25.9 NaN ... 0.8 NaN 0.1 0.0
and I need the following result:
temperature_Paris humidity_Paris temperature_Tokyo humidity_Tokyo temperature_Sydney ... humidity_New-York temperature_Rio humidity_Rio temperature_Berlin humidity_Berlin
2020-01-01 00:00:00 21.2 0.3 17.5 1.0 26.3 ... NaN NaN NaN NaN NaN
2020-01-02 00:00:00 19.3 0.6 15.1 0.2 22.8 ... 0.1 28.7 0.6 18.4 0.4
2020-01-03 00:00:00 25.9 0.8 27.5 0.0 29.7 ... 0.6 NaN NaN NaN NaN
The various solutions offered for questions that look similar, like essentially:
df_ = df_.reset_index().rename_axis([None, None], axis=1)
do not work here.
Replace:
Col = list(set(df['city'].values))
for value in values:
df_.rename(columns={i: value + '_' + i for i in Col}, inplace=True)
With:
df_.columns = ['_'.join(i) for i in df_.columns]
Outputs:
temperature_Berlin temperature_New-York ... humidity_Sydney humidity_Tokyo
date
2020-01-01 00:00:00 NaN NaN ... 0.3 0.6
2020-01-02 00:00:00 23.3 26.3 ... 0.8 0.0
2020-01-03 00:00:00 NaN 14.6 ... 0.2 0.6
Edit:
A probably more elegant alternative, as suggested by #Henry Ecker in the comments:
df_.columns = df_.columns.map('_'.join)
You can use Index.map() with f-string, as follows:
df_.columns = df_.columns.map(lambda x: f'{x[0]}_{x[1]}')
Using this way, you have the freedom to arrange the sequence of the combined words from the MultiIndex as you wish. E.g. if you want to get the city name first then the word 'temperature' (e.g. Berlin_temperature instead), you can just reverse the sequence of x[0] and x[1] in the f-string above.
Result:
print(df_)
temperature_Berlin temperature_New-York temperature_Paris temperature_Rio temperature_Sydney temperature_Tokyo humidity_Berlin humidity_New-York humidity_Paris humidity_Rio humidity_Sydney humidity_Tokyo
date
2020-01-01 00:00:00 NaN NaN 22.8 NaN 24.7 28.8 NaN NaN 1.0 NaN 0.9 0.0
2020-01-02 00:00:00 20.2 21.5 21.6 21.6 4.3 21.5 0.5 0.5 1.0 0.4 0.4 0.0
2020-01-03 00:00:00 NaN 17.3 24.4 NaN 11.3 22.7 NaN 0.4 0.1 NaN 0.0 0.5

Find a value in a column in function of another column

Assuming that the value exists, how can I for example, create another column "testFinal" in the dataframe where I will have the absolute value of df["test"]- " df["test"] which is 0.2 seconds after "
for example, the first value for testFinal is the absolute value of the difference between 2 and the value 0.2 seconds after -> so 8, the result is abs(2-8) = 6
My goal is to calculate "testFinal"
I don't know if its clear so here is the example
NB : the Timestamp is not homogeneous, so the interval between two values can be different over time
Thanks a lot
Here is the code for the dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10],
'test':[2,22,8,4,5,4,5,3,54,23,89],
'testFinal':[6,18,3,0,0,1,49,20,35,np.NaN,np.NaN]})
First, create a new temporary column temp obtained from converting the Timestamp column to timedelta using pd.to_timedelta, then set this temp column as dataframe index, then create a new column testFinal having the values as this new index + 0.2 seconds, then using Series.map, map the testFinal column to the values from df['test'] column, thus now the testFinal column should have values corresponding to the values in test column 0.2s later, thereafter you can subtract the values in the testFinal and test column to get the desired result:
df['temp'] = pd.to_timedelta(df['Timestamp'], unit='s')
df = df.set_index('temp')
df['testFinal'] = df.index + pd.Timedelta(seconds=0.2)
df['testFinal'] = df['testFinal'].map(df['test']).sub(df['test']).abs()
df = df.reset_index(drop=True)
# print(df)
Timestamp test testFinal
0 11.1 2 6.0
1 11.2 22 18.0
2 11.3 8 3.0
3 11.4 4 0.0
4 11.5 5 0.0
5 11.6 4 1.0
6 11.7 5 49.0
7 11.8 3 20.0
8 11.9 54 35.0
9 12.0 23 NaN
10 12.1 89 NaN
You could use numpy as follows. I created a new column test_final to compare with the expected testFinal column.
import numpy as np
test = df.test.values
df['test_final'] = np.abs(test - np.concatenate((test[2:], np.array([np.nan]*2)), axis=0))
print(df)
Output:
Timestamp test testFinal test_final
0 11.1 2 6.0 6.0
1 11.2 22 18.0 18.0
2 11.3 8 3.0 3.0
3 11.4 4 0.0 0.0
4 11.5 5 0.0 0.0
5 11.6 4 1.0 1.0
6 11.7 5 49.0 49.0
7 11.8 3 20.0 20.0
8 11.9 54 35.0 35.0
9 12.0 23 NaN NaN
10 12.1 89 NaN NaN

Categories

Resources