Downsample time series whenever abs(difference) since the previous sample exceeds threshold - python

I have a timeseries of intraday tick-by-tick stock prices that change gradually over time. Whenever there is a small change (e.g. the price increases by $0.01), a new row of data is created. This leads to a very large data series which is slow to plot. I want to downsample so that small changes (e.g. the price goes up/down/up/down/up/down and is unchanged after 50 rows of data) are ignored, which improves plotting speed without sacrificing the qualitative accuracy of the graph. I only want to sample if the price goes up/up/up/up so that I am only displaying obvious changes.
import pandas as pd
import numpy as np
prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))
I wish to sample whenever the difference with the previous sample exceeds some threshold. So, I will sample row 0 by default. If row 1, 2, 3 and 4 are too close to row 0, I want to throw them away. Then, if row 5 is sufficiently far away from row 0, I will sample that. Then, row 5 becomes my new anchor point, and I will repeat the same process described immediately above.
Is there a way to do this, ideally without a loop?

You could apply a down-sampling masking function that checks if the distance has been exceeded. Then use that to select to select the applicable rows.
Here is the down-sampling masking function:
def down_mask(x, max_dist=3):
global cum_diff
# if NaN return True
if x!=x:
return True
cum_diff += x
if abs(cum_diff) > max_dist:
cum_diff = 0
return True
return False
Then apply it and use it as a mask to get the entries that you want:
cum_diff = 0
df[df['prices'].diff().apply(down_mask, max_dist=5)]
prices
0 1002.07
1 1007.37
2 1000.09
6 1008.08
10 1001.57
14 1006.74
18 1000.42
19 1006.98
21 1001.30
26 1008.89
28 1003.77
38 1009.04
40 1000.52
44 1007.06
47 1001.21
48 1009.38
49 1001.81
51 1008.64
52 1002.72
55 1008.84
56 1000.86
57 1007.17
67 1001.31
68 1006.33
79 1001.14
98 1009.74
99 1000.53

Not exactly what was asked for. I offer two options with a threshold and a threshold and a sliding period.
import pandas as pd
import numpy as np
prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))
threshold_ = 3
index = np.abs(prices['A'].values[1:] - prices['A'].values[:-1]) > threshold_
index = np.insert(index, 0, True)
print(prices[index == True], len(prices[index == True]))
period = 5
hist = len(prices)
index = np.abs(prices['A'].values[period:] - prices['A'].values[:hist-period]) > threshold_
index = np.insert(index, 0, np.empty((1,period), dtype=bool)[0])
print(prices[index == True], len(prices[index == True]))

Related

To identify what are the channels that increase more than 10% against the data of last week

I have a large data frame across different timestamps. Here is my attempt:
all_data = []
for ws in wb.worksheets():
rows=ws.get_all_values()
df_all_data=pd.DataFrame.from_records(rows[1:],columns=rows[0])
all_data.append(df_all_data)
data = pd.concat(all_data)
#Change data type
data['Year'] = pd.DatetimeIndex(data['Week']).year
data['Month'] = pd.DatetimeIndex(data['Week']).month
data['Week'] = pd.to_datetime(data['Week']).dt.date
data['Application'] = data['Application'].astype('str')
data['Function'] = data['Function'].astype('str')
data['Service'] = data['Service'].astype('str')
data['Channel'] = data['Channel'].astype('str')
data['Times of alarms'] = data['Times of alarms'].astype('int')
#Compare Channel values over weeks
subchannel_df = data.pivot_table('Times of alarms', index = 'Week', columns='Channel', aggfunc='sum').fillna(0)
subchannel_df = subchannel_df.sort_index(axis=1)
The data frame I am working on
What I hope to achieve:
add a percentage row (the last row vs the second last row) at the end of the data frame, excluding situations as such: divide by zero and negative percentage
show those channels which increase more than 10% as compared against last week.
I have been trying different methods to achieve those for days. However, I would not manage to do it. Thank you in advance.
You could use the shift function as an equivalent to Lag window function in SQL to return last week's value and then perform the calculations in row level. To avoid dividing by zero you can use numpy where function that is equivalent to CASE WHEN in SQL. Let's say your column value on which you perform the calculations named: "X"
subchannel_df["XLag"] = subchannel_df["X"].shift(periods=1).fillna(0).astype('int')
subchannel_df["ChangePercentage"] = np.where(subchannel_df["XLag"] == 0, 0, (subchannel_df["X"]-subchannel_df["XLag"])/subchannel_df["XLag"])
subchannel_df["ChangePercentage"] = (subchannel_df["ChangePercentage"]*100).round().astype("int")
subchannel_df[subchannel_df["ChangePercentage"]>10]
Output:
Channel X XLag ChangePercentage
Week
2020-06-12 12 5 140
2020-11-15 15 10 50
2020-11-22 20 15 33
2020-12-13 27 16 69
2020-12-20 100 27 270

Is there a faster way to split a pandas dataframe into two complementary parts?

Good evening all,
I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.
What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.
In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0.
I came up with the following code that does this sufficiently, but it is remarkably slow. It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.
data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []
for i in range(0,len(data)):
random_row = data.sample(n=1).iloc[0]
listy1.append(random_row.tolist())
if random_row["773"] == 0.0:
x = data[data["773"] == 1.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
else:
x = data[data["773"] == 0.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)
Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."
Do you have some insight into why this is so slow, or any suggestions as to make this faster?
A key concept in efficient numpy/scipy/pandas coding is using library-shipped vectorized functions whenever possible. Try to process multiple rows at once instead of iterate explicitly over rows. i.e. avoid for loops and .iterrows().
The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:
Draw the main dataset at once.
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.
Code:
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(52) # reproducibility
n = 10000
df = pd.DataFrame(
data={
"773": [0,1]*int(n/2),
"dummy1": list(range(n)),
"dummy2": list(range(0, 10*n, 10))
}
)
t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n) # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)
# 2. draw the complementary dataset
# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1
# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)
# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)
# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values # df_0 into ~mask_0
print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")
Check
print(df_main.head(5))
773 dummy1 dummy2
0 0 28 280
1 1 11 110
2 1 13 130
3 1 23 230
4 0 86 860
print(df_comp.head(5))
773 dummy1 dummy2
0 1 19 190
1 0 74 740
2 0 28 280 <- this row is complementary to df_main
3 0 60 600
4 1 37 370
Efficiency gain: 14.23s -> 0.011s (ca. 128x)

suggestion on how to solve an infinte loop problem (python-pandas)

I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F

Taking single value from a grouped data frame in Pandas

I am a new Python convert (from Matlab). I am using the pandas groupby function, and I am getting tripped up by a seemingly easy problem. I have written a custom function that I apply to the grouped df that returns 4 different values. Three of the values are working great, but the other value is giving me an error. Here is the original df:
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
...
This is the function that transforms the data. Basically, it counts the number of 'positive' values and the total number of observations in the group. I also want it to return the ID value, and this is where the problem is:
def _ct_id_pos(grp):
return grp['ID'][0], grp[grp.A == 'positive'].shape[0], grp[grp.B == 'positive'].shape[0], grp.shape[0]
I apply the _ct_id_pos function to the data grouped by Date and SN:
FullMx_prime = FullMx.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index()
So, the method should return something like this:
Date SN ID 0
0 9/1/16 32 360 (360,2,1,4)
1 9/1/16 35 718 (718,0,0,1)
2 9/2/16 38 728 (728,1,0,2)
3 9/3/16 30 728 (728,2,0,3)
But, I keep getting the following error:
...
KeyError: 0
Obviously, it does not like this part of the function: grp['ID'][0] . I just want to take the first value of grp['ID'] because--if there are multiple values--they should all be the same (i.e., I could take the last, it does not matter). I have tried other ways to index, but to no avail.
Change grp['ID'][0] to grp.iloc[0]['ID']
The problem you are having is due to grp['ID'] which selects a column and returns a pandas.Series. Which is straight forward enough, and you could reasonably expect that [0] would select the first element. But the [0] actually selects based on the index for the Series, and in this case the index is from the dataframe that was grouped. So, 0 is not always going to be a valid index.
Code:
def _ct_id_pos(grp):
id = grp.iloc[0]['ID']
a = grp[grp.A == 'positive'].shape[0]
b = grp[grp.B == 'positive'].shape[0]
sz = grp.shape[0]
return id, a, b, sz
Test Code:
df = pd.read_csv(StringIO(u"""
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
"""), header=0, index_col=0)
print(df.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index())
Results:
Date SN 0
0 9/1/16 30 (728-13, 0, 0, 3)
1 9/1/16 32 (360, 0, 1, 4)
2 9/1/16 35 (718, 0, 0, 1)
3 9/1/16 38 (728-13, 0, 0, 2)

pandas groupby - return min() along with time where min() occurs

My data is organized in multi-index dataframes. I am trying to groupby the "Sweep" index and return both the min (or max) in a specific time range, along with the time at which that time occurs.
Data looks like:
Time Primary Secondary BL LED
Sweep
Sweep1 0 0.00000 -28173.828125 -0.416565 -0.000305
1 0.00005 -27050.781250 -0.416260 0.000305
2 0.00010 -27490.234375 -0.415955 -0.002441
3 0.00015 -28222.656250 -0.416260 0.000305
4 0.00020 -28759.765625 -0.414429 -0.002136
Getting the min or max is very straightforward.
def find_groupby_peak(voltage_df, start_time, end_time, peak="min"):
boolean_vr = (voltage_df.Time >= start_time) & (voltage_df.Time <=end_time)
df_subset = voltage_df[boolean_vr]
grouped = df_subset.groupby(level="Sweep")
if peak == "min":
peak = grouped.Primary.min()
elif peak == "max":
peak = grouped.max()
return peak
Which gives (partial output):
Sweep
Sweep1 -92333.984375
Sweep10 -86523.437500
Sweep11 -85205.078125
Sweep12 -87109.375000
Sweep13 -77929.687500
But I need to time where those peaks occur as well. I know I could iterate over the output and find where in the original dataset those values occur, but that seems like a rather brute-force way to do it. I also could write a different function to apply to the grouped object that returns both the max and the time where that max occurs (at least in theory - haven't tried to do this, but I assume it's pretty straightforward).
Other than those two options, is there a simpler way to pass the outputs from grouped.Primary.min() (i.e. the peak values) to return where in Time those values occur?
You could consider using the transform function with groupby. If you had data that look a bit like this:
import pandas as pd
sweep = ["sweep1", "sweep1", "sweep1", "sweep1",
"sweep2", "sweep2", "sweep2", "sweep2",
"sweep3", "sweep3", "sweep3", "sweep3",
"sweep4", "sweep4", "sweep4", "sweep4"]
Time = [0.009845, 0.002186, 0.006001, 0.00265,
0.003832, 0.005627, 0.002625, 0.004159,
0.00388, 0.008107, 0.00813, 0.004813,
0.003205, 0.003225, 0.00413, 0.001202]
Primary = [-2832.013203, -2478.839133, -2100.671551, -2057.188346,
-2605.402055, -2030.195497, -2300.209967, -2504.817095,
-2865.320903, -2456.0049, -2542.132906, -2405.657053,
-2780.140743, -2351.743053, -2232.340363, -2820.27356]
s_count = [ 0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3]
df = pd.DataFrame({ 'Time' : Time,
'Primary' : Primary}, index = [sweep, s_count])
Then you could write a very simple transform function that will return for each group of data (grouped by the sweep index), the row at which the minimum value of 'Primary' is located. This you would do with simple boolean slicing. That would look like this:
def trans_function(df):
return df[df.Primary == min(df.Primary)]
Then to use this function simply call it inside the transform method:
df.groupby(level = 0).transform(trans_function)
And that gives me the following output:
Primary Time
sweep1 0 -2832.013203 0.009845
sweep2 0 -2605.402055 0.003832
sweep3 0 -2865.320903 0.003880
sweep4 3 -2820.273560 0.001202
Obviously you could incorporate that into you function that is acting on some subset of the data if that is what you require.
As an alternative you could index the group by using the argmin() function. I tried to do this with transform but it was just returning the entire dataframe. I'm not sure why that should be, it does however work with apply:
def trans_function2(df):
return df.loc[df['Primary'].argmin()]
df.groupby(level = 0).apply(trans_function2)
That again gives me:
Primary Time
sweep1 -2832.013203 0.009845
sweep2 -2605.402055 0.003832
sweep3 -2865.320903 0.003880
sweep4 -2820.273560 0.001202
I'm not totally sure why this function does not work with transform - perhaps someone will enlighten us.
I do not know if this will work with your multi-index frame, but it is worth a try; working with:
>>> df
tag tick val
z C 2014-09-07 32
y C 2014-09-08 67
x A 2014-09-09 49
w A 2014-09-10 80
v B 2014-09-11 51
u B 2014-09-12 25
t C 2014-09-13 22
s B 2014-09-14 8
r A 2014-09-15 76
q C 2014-09-16 4
find the indexer using idxmax and then use .loc:
>>> i = df.groupby('tag')['val'].idxmax()
>>> df.loc[i]
tag tick val
w A 2014-09-10 80
v B 2014-09-11 51
y C 2014-09-08 67

Categories

Resources