Identify Updated Value in Time Series Data Python Pandas - python

I don't do a lot of time series work and I know the way I'm thinking about this solution is sub-optimal. Wanted to get input as to the most efficient way to approach this issue.
I have several days of values with multiple values per day identified by a time stamp.
Data looks like this:
Index Period Value Timestamp
0 1 73 2017-08-10 16:44:23
1 1 73 2017-08-09 16:30:12
2 1 73 2017-08-08 16:40:31
3 2 50 2017-08-10 16:44:23
4 2 45 2017-08-09 16:30:12
5 2 45 2017-08-08 16:40:31
6 3 13 2017-08-10 16:44:23
7 3 13 2017-08-09 16:30:12
8 3 13 2017-08-08 16:40:31
The example shows one data element for three different periods captured three days in a row. The idea is determining if the value for any of the measured period (Period 1, 2, or 3) changes.
As you can see in the example, on the third day (2017-08-10) The value for Period 2 was updated. I want to detect that changed value.
The only way I can figure out how do compare is to loop through which I think is both inelegant, inefficient and definitely not Pythonic.
Anyone have insight into a means of approach without looping/iteration?
Thanks in advance.
EDIT
Expected Output would be a df as follows if there is a value change in the most recent timestamped data:
Index Period Value Timestamp
0 1 73 2017-08-10 16:44:23
3 2 50 2017-08-10 16:44:23
6 3 13 2017-08-10 16:44:23

First, you can identify rows with a change like this:
df['diff'] = df.groupby('Period')['Value'].diff(-1).fillna(0)
Period Value Timestamp diff
0 1 73 2017-08-10 16:44:23 0.0
1 1 73 2017-08-09 16:30:12 0.0
2 1 73 2017-08-08 16:40:31 0.0
3 2 50 2017-08-10 16:44:23 5.0
4 2 45 2017-08-09 16:30:12 0.0
5 2 45 2017-08-08 16:40:31 0.0
6 3 13 2017-08-10 16:44:23 0.0
7 3 13 2017-08-09 16:30:12 0.0
8 3 13 2017-08-08 16:40:31 0.0
Then select rows to display (all rows with the same timestamp as a row with a change):
lst = df[ df['diff'] != 0. ]['Timestamp'].tolist()
df[ df['Timestamp'].isin(lst) ]
Period Value Timestamp diff
0 1 73 2017-08-10 16:44:23 0.0
3 2 50 2017-08-10 16:44:23 5.0
6 3 13 2017-08-10 16:44:23 0.0

Related

Python Pandas - Time Series Find Index of Previous Row

I am analyzing time series data of one stock to seek the highest price for further analysis, here is the sample dataframe df:
date close high_3days
2021-05-01 20 20
2021-05-02 23 23
2021-05-03 26 26
2021-05-04 24 26
2021-05-05 20 26
2021-05-06 26 26
2021-05-07 22 26
2021-05-08 30 30
2021-05-09 20 30
2021-05-10 20 30
I want to add a new column to find the number of days from previous 3 days high. My logic is seeking the index of the row of previous high, and then subtract it from the index of current row.
Here is the desire output:
date close high_3days days_previous_high
2021-05-01 20 20 0
2021-05-02 23 23 0
2021-05-03 26 26 0
2021-05-04 24 26 1
2021-05-05 20 26 2
2021-05-06 22 26 3
2021-05-07 20 26 4
2021-05-08 30 30 0
2021-05-09 20 30 1
2021-05-10 20 30 2
Could you help to figure the way out~? Thanks guys!
Try creating a boolean index with expanding max, then enumerate each group with groupby cumcount:
df['days_previous_high'] = df.groupby(
df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
df:
date close high_3days days_previous_high
0 2021-05-01 20 20 0
1 2021-05-02 23 23 0
2 2021-05-03 26 26 0
3 2021-05-04 24 26 1
4 2021-05-05 20 26 2
5 2021-05-06 22 26 3
6 2021-05-07 20 26 4
7 2021-05-08 30 30 0
8 2021-05-09 20 30 1
9 2021-05-10 20 30 2
Explaination:
expanding max is used to determine the current maximum value at each row.
df['high_3days'].expanding().max()
diff can be used to see where the current value exceeds the max.
df['high_3days'].expanding().max().diff()
groups can be created by taking the cumsum of where the diff is greater than 0:
df['high_3days'].expanding().max().diff().gt(0).cumsum()
expanding_max expanding_max_diff expanding_max_gt_0 expanding_max_gt_0_cs
20.0 NaN False 0
23.0 3.0 True 1
26.0 3.0 True 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
30.0 4.0 True 3
30.0 0.0 False 3
30.0 0.0 False 3
Now that rows are grouped, groupby cumcount can be used to enumerate each group:
df.groupby(df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64

Pandas Q-cut: Binning Data using an Expanding Window Approach

This question is somewhat similar to a 2018 question I have found on an identical topic.
I am hoping that if I ask it in a simpler way, someone will be able to figure out a simple fix to the issue that I am currently facing:
I have a timeseries dataframe named "df", which is roughly structured as follows:
V_1 V_2 V_3 V_4
1/1/2000 17 77 15 88
1/2/2000 85 78 6 59
1/3/2000 31 9 49 16
1/4/2000 81 55 28 33
1/5/2000 8 82 82 4
1/6/2000 89 87 57 62
1/7/2000 50 60 54 49
1/8/2000 65 84 29 26
1/9/2000 12 57 53 84
1/10/2000 6 27 70 56
1/11/2000 61 6 38 38
1/12/2000 22 8 82 58
1/13/2000 17 86 65 42
1/14/2000 9 27 42 86
1/15/2000 63 78 18 35
1/16/2000 73 13 51 61
1/17/2000 70 64 75 83
If I wanted to use all the columns to produce daily quantiles, I would follow this approach:
quantiles = df.apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
The output looks like this:
V_1 V_2 V_3 V_4
2000-01-01 1 3 0 4
2000-01-02 4 3 0 3
2000-01-03 2 0 2 0
2000-01-04 4 1 0 0
2000-01-05 0 4 4 0
2000-01-06 4 4 3 3
2000-01-07 2 2 3 2
2000-01-08 3 4 1 0
2000-01-09 0 2 2 4
2000-01-10 0 1 4 2
2000-01-11 2 0 1 1
2000-01-12 1 0 4 2
2000-01-13 1 4 3 1
2000-01-14 0 1 1 4
2000-01-15 3 3 0 1
2000-01-16 4 0 2 3
2000-01-17 3 2 4 4
What I want to do:
I would like to produce quantiles of the data in "df" using observations that occurred before and at a specific point in time. I do not want to include observations that occurred after the specific point in time.
For instance:
To calculate the bins for the 2nd of January 2000, I would like to just use observations from the 1st and 2nd of January 2000; and, nothing after the dates;
To calculate the bins for the 3rd of January 2000, I would like to just use observations from the 1st, 2nd and 3rd of January 2000; and, nothing after the dates;
To calculate the bins for the 4th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd and 4th of January 2000; and, nothing after the dates;
To calculate the bins for the 5th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd, 4th and 5th of January 2000; and, nothing after the dates;
Otherwise put, I would like to use this approach to calculate the bins for ALL the datapoints in "df". That is, to calculate bins from the 1st of January 2000 to the 17th of January 2000.
In short, what I want to do is to conduct an expanding window q-cut (if there is any such thing). It helps to avoid "look-ahead" bias when dealing with timeseries data.
This code block below is wrong, but it illustrates exactly what I am trying to accomplish:
quantiles = df.expanding().apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
Does anyone have any ideas of how to do this in a simpler fashion than this
I am new so take this with a grain of salt, but when broken down I believe your question is a duplicate because it requires simple datetime index slicing answered HERE.
lt_jan_5 = df.loc[:'2000-01-05'].apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
print(lt_jan_5)
V_1 V_2 V_3 V_4
2000-01-01 1 2 1 4
2000-01-02 4 3 0 3
2000-01-03 2 0 3 1
2000-01-04 3 1 2 2
2000-01-05 0 4 4 0
Hope this is helpful

Find average of every column in a dataframe, grouped by column, exluding one value

I have a Dataframe like the one presented below:
CPU Memory Disk Label
0 21 28 29 0
1 46 53 55 1
2 48 45 49 2
3 48 52 50 3
4 51 54 55 4
5 45 50 56 5
6 50 83 44 -1
What I want is to grouby and find the average of each label. So far I have this
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean() which works just fine and get the results as follows:
Label CPU Memory Disk
-1 46.441176 53.882353 54.176471
0 48.500000 58.500000 60.750000
1 45.000000 51.000000 60.000000
2 54.000000 49.000000 56.000000
3 55.000000 71.500000 67.500000
4 53.000000 70.000000 71.000000
5 21.333333 30.000000 30.666667
The only thing I haven't yet found is how to exclude everything that is labeled as -1. Is there a way to do that?
You could filter the dataframe before grouping:
# Exclude rows with Label=-1
dataset = dataset.loc[dataset['Label'] != -1]
# Group by on filtered result
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean()

correcting data and replace the new data in pandas

I want to plot this data frame:
Date TradeSize
0 2013-04-17 0.780000
1 2013-04-20 0.034000
2 2013-04-23 21.972500
3 2013-05-28 0.021000
4 2013-06-16 11.000000
5 2013-06-19 0.013000
6 2013-07-01 9.021000
7 2013-07-13 0.150000
8 2013-09-01 6.000000
9 2013-09-04 0.008000
10 2013-09-16 0.082000
11 2013-09-17 0.010000
12 2013-09-21 0.161000
13 2013-09-22 1.000000
14 2013-09-23 1.000000
15 2013-09-24 1.119000
16 2013-09-28 1.000000
17 2013-12-17 3.000000
18 2013-12-18 1.500000
19 2014-01-11 1.170000
20 2014-01-14 0.000100
21 2014-01-25 4.000000
22 2014-01-26 0.060000
23 2014-01-28 2.029900
24 2014-02-22 0.089900
25 2014-03-02 8.000000
26 2014-03-18 0.008000
27 2014-03-31 0.000100
28 2014-04-05 0.052000
29 2014-04-19 0.122000
30 2014-04-20 0.027000
31 2014-04-21 0.000100
32 2014-04-22 0.001100
33 2014-04-27 0.100000
34 2014-04-29 0.039000
35 2014-05-05 3.521000
36 2014-05-07 0.000105
37 2014-05-11 0.000100
38 2014-05-14 0.000100
39 2014-06-15 0.000800
40 2014-06-21 0.000500
41 2014-06-24 0.000600
42 2014-06-28 0.000400
43 2014-07-14 0.135000
44 2014-07-15 0.002300
45 2014-07-21 300.000000
46 2014-07-22 10.000000
47 2014-08-09 2.000000
48 2014-08-23 19.000000
49 2014-09-13 2.000000
But there is a restriction i should apply on data, it is for prettify the plot,
If next row TradeSize Value is not in the range of +-10% of today's TradeSize Value should be replaced by average of today's TradeSize value and next row TradeSize value; to clarify see this example:
Date TradeSize
1 2013-04-20 0.034000
2 2013-04-23 21.972500
the value of index 2 is greater than +10% of value of index 1 so the value of index 2 should be replaced by value of average of this two index and so on.
if the value is also -10% it should do the same!
If i understand right, 'tomorrow' means next row?
calculate +-10% value first:
min_v = (df['TradeSize'] * 0.9).shift() #shift to next row
max_v = (df['TradeSize'] * 1.1).shift()
df = df.assign(min_v=min_v, max_v=max_v)
get average then:
df = df.assign(avg=(df['TradeSize']+df['TradeSize'].shift())/2.)
make a copied result columns (for plot):
df = df.assign(res=df['TradeSize'].copy())
find +-10% and replace it with average result:
not_in_range_bool = (df['TradeSize'] < df['min_v']) | (df['TradeSize'] > df['max_v'])
not_in_range_bool[0] = False #first row can not be calculate, set it to False
df.loc[not_in_range_bool, 'res'] = df.loc[not_in_range_bool, 'avg']
now you can use df['res'] for prettify the plot

Python: How to split hourly values into 15 minute buckets?

happy new year to all!
I guess this question might be an easy one, but i can't figure it out.
How can i turn hourly data into 15 minute buckets quickly in python (see table below). Basically the left column should be converted into the right one.Just duplicate the hourly value for times and dump it into a new column.
Thanks for the support!
Cheers!
Hourly 15mins
1 28.90 1 28.90
2 28.88 1 28.90
3 28.68 1 28.90
4 28.67 1 28.90
5 28.52 2 28.88
6 28.79 2 28.88
7 31.33 2 28.88
8 32.60 2 28.88
9 42.00 3 28.68
10 44.00 3 28.68
11 44.00 3 28.68
12 44.00 3 28.68
13 39.94 4 28.67
14 39.90 4 28.67
15 38.09 4 28.67
16 39.94 4 28.67
17 44.94 5 28.52
18 66.01 5 28.52
19 49.45 5 28.52
20 48.37 5 28.52
21 38.02 6 28.79
22 34.55 6 28.79
23 33.33 6 28.79
24 32.05 6 28.79
7 31.33
7 31.33
7 31.33
7 31.33
You could also do this through constructing a new DataFrame and using numpy methods.
import numpy as np
pd.DataFrame(np.column_stack((np.arange(df.shape[0]).repeat(4, axis=0),
np.array(df).repeat(4, axis=0))),
columns=['hours', '15_minutes'])
which returns
hours 15_minutes
0 0 28.90
1 0 28.90
2 0 28.90
3 0 28.90
4 1 28.88
5 1 28.88
...
91 22 33.33
92 23 32.05
93 23 32.05
94 23 32.05
95 23 32.05
column_stack appends arrays by columns (index=0). np.arange(df.shape[0]).repeat(4, axis=0) gets the hour IDs by repeating 0 through 23 four times, and the values for each 15 minutes is constructed in a similar manner. pd.DataFrame produces the DataFrames and column names are added as well.
Create datetime-like index for your DataFrame, then you can use resample.
series.resample('15T')

Categories

Resources