How to find the alignment of two data sets in pandas

How to find the alignment of two data sets in pandas - python

Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.

Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths

Related

how to cluster values of continuous time series

In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing

My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.

How can i replace outliers with the mean of previous and next neighbour?

I have a really large dataset from beating two laser frequencies and reading out the beat frequency with a freq. counter.
The problem is that I have a lot of outliers in my dataset.
Filtering is not an option since the filtering/removing of outliers kills precious information for my allan deviation I use to analyze my beat frequency.
The problem with removing the outliers is that i want to compare allan deviations of three different beat frequencies. If i now remove some points i will have shorter x-axis than before and my allan deviation x-axis will scale differently. (The adev basically builds up a new x-axis starting with intervals of my sample rate up to my longest measurement time -> which is my highest beat frequency x-axis value.)
Sorry if this is confusing, I wanted to give as many information as possible.
So anyway, what i did until now is i got my whole allan deviation to work and removed outliers successfully, chopping my list into intervals and compare all y-values of each interval to the standard deviation of the interval.
What i want to change now is that instead of removing the outliers i want to replace them with the mean of their previous and next neighbours.
Below you can find my test code for a list with outliers, it seems have a problem using numpy where and i don't really understand why.
The error is given as "'numpy.int32' object has no attribute 'where'". Do I have to convert my dataset to a panda structure?
What the code does is searching for values above/below my threshold, replace them with NaN, and then replace NaN with my mean. I'm not really into using NaN replacement so i would be very grateful for any help.
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
print(*l)
sd = np.std(l[:,1])
print(sd)
for i in l[:,1]:
if l[i,1] > sd:
print(l[i,1])
l[i,1].where(l[i,1].replace(to_replace = l[i,1], value = np.nan),
other = (l[i,1].fillna(method='ffill')+l[i,1].fillna(method='bfill'))/2)
so what i want is to have a list/array with the outliers replaced with the means of previous/following neighbours
error message: 'numpy.int32' object has no attribute 'where'

One option is indeed tranform all the work into pandas just with
import pandas as pd
dataset = pd.DataFrame({'Column1':data[:,0],'Column2':data[:,1]})
that will solve error as pandas dataframe object has where command.
Howewer, that is not obligatory and we can still operate with just numpy
For example, the easiest way to detect outliers is to look if they are not in range mean+-3std.
Code example below, using your setting
import numpy as np
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
std = np.std(l[:,1])
mean=np.mean(l[:,1])
for i in range (len(l[:,1])):
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
else:
if (i!=len(l[:,1])-1)&(i!=0):
l[i,1]=(l[i-1,1]+l[i+1,1])/2
else:
l[i,1]=mean
What we did here first check is value is outlier at line
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
Then check if its not first or last element
if (i!=len(l[:,1])-1)&(i!=1):
If it is, just put mean to the field:
else:
l[i,1]=mean

Any existing methods to find a drop in a noisy time series?

I have a time series (array of values) and I would like to find the starting points where a long drop in values begins (at least X consecutive values going down). For example:
Having a list of values
[1,2,3,4,3,4,5,4,3,4,5,4,3,2,1,2,3,2,3,4,3,4,5,6,7,8]
I would like to find a drop of at least 5 consecutive values. So in this case I would find the segment 5,4,3,2,1.
However, in a real scenario, there is noise in the data, so the actual drop includes a lot of little ups and downs.
I could write an algorithm for this. But I was wondering whether there is an existing library or standard signal processing method for this type of analysis.

You can do this pretty easily with pandas (which I know you have). Convert your list to a series, and then perform a groupby + count to find consecutively declining values:
v = pd.Series([...])
v[v.groupby(v.diff().gt(0).cumsum()).transform('size').ge(5)]
10 5
11 4
12 3
13 2
14 1
dtype: int64

Selecting slices of a Pandas Series based on both index and value conditions

I have a Pandas Series which contains acceleration timeseries data. My goal is to select slices of extreme force given some threshold. I was able to get part way with the following:
extremes = series.where(lambda force: abs(force - RESTING_FORCE) >= THRESHOLD, other=np.nan)
Now extremes contains all values which exceed the threshold and NaN for any that don't, maintaining the original index.
However, a secondary requirement is that nearby peaks should be merged into a single event. Visually, you can picture the three extremes on the left (two high, one low) being joined into one complete segment and the two peaks on the right being joined into another complete segment.
I've read through the entire Series reference but I'm having trouble finding methods to operate on my partial dataset. For example, if I had a method that returned an array of non-NaN index ranges, I would be able to sequentially compare each range and decide whether or not to fill in the space between with values from the original series (nearby) or leave them NaN (too far apart).
Perhaps I need to abandon the intermediate step and approach this from an entirely different angle? I'm new to Python so I'm having trouble getting very far with this. Any tips would be appreciated.

It actually wasn't so simple to come up with a vectorized solution without looping.
You'll probably need to go through the code step by step to see the actual outcome of each method but here is short sketch of the idea:
Solution outline
Identify all peaks via simple threshold filter
Get timestamps of peak values into a column and forward fill gaps in between in order to allow to compare current valid timestamp with previous valid timestamp
Do actual comparison via diff() to get time deltas and apply time delta comparison
Convert booleans to integers to use cummulative sum to create signal groups
Group by signals and get min and max timestamp values
Example data
Here is the code with a dummy example:
%matplotlib inline
import pandas as pd
import numpy as np
size = 200
# create some dummy data
ts = pd.date_range(start="2017-10-28", freq="d", periods=size)
values = np.cumsum(np.random.normal(size=size)) + np.sin(np.linspace(0, 100, size))
series = pd.Series(values, index=ts, name="force")
series.plot(figsize=(10, 5))
Solution code
# define thresholds
threshold_value = 6
threshold_time = pd.Timedelta(days=10)
# create data frame because we'll need helper columns
df = series.reset_index()
# get all initial peaks below or above threshold
mask = df["force"].abs().gt(threshold_value)
# create variable to store only timestamps of intial peaks
df.loc[mask, "ts_gap"] = df.loc[mask, "index"]
# create forward fill to enable comparison between current and next peak
df["ts_fill"] = df["ts_gap"].ffill()
# apply time delta comparison to filter only those within given time interval
df["within"] = df["ts_fill"].diff() < threshold_time
# convert boolean values into integers and
# create cummulative sum which creates group of consecutive timestamps
df["signals"] = (~df["within"]).astype(int).cumsum()
# create dataframe containing start and end values
df_signal = df.dropna(subset=["ts_gap"])\
.groupby("signals")["ts_gap"]\
.agg(["min", "max"])
# show results
df_signal
>>> min max
signals
10 2017-11-06 2017-11-27
11 2017-12-13 2018-01-22
12 2018-02-03 2018-02-23
Finally, show the plot:
series.plot(figsize=(10, 5))
for _, (idx_min, idx_max) in df_signal.iterrows():
series[idx_min:idx_max].plot()
Result
As you can see in the plot, peaks greater an absolute value of 6 are merged into a single signal if their last and first timestamps are within a range of 10 days. The thresholds here are arbitrary just for illustration purpose. you can change them to whatever.

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA

For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find the alignment of two data sets in pandas - python

Related

how to cluster values of continuous time series

How can i replace outliers with the mean of previous and next neighbour?

Any existing methods to find a drop in a noisy time series?

Selecting slices of a Pandas Series based on both index and value conditions

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

Categories

Resources