Extracting the threshold value of some given raw numbers - python

I am trying to determine the conditions of a wireless channel by analysis of captured I/Q samples. Indeed, I have a 50000 data samples and as it is shown in the attached figure, there are some sparks in the graphs when there is an activity (e.g. data transmission) over the channel. I am trying to count the number of sparks which are data values higher than a threshold.
I need to have an accurate estimation of the threshold and then I can find the channel load. the threshold value in the attached figure is around 0.0025 and it should be noted that it varies over time. So, each time that I took 50000 samples, I have to find the threshold value first using some sort of unsupervised learning.
I tried k-means (in python scikit-learn) to cluster the data and find the centroids of the estimated clusters, but it can't give me good estimation on the threshold value (especially when there is no activity over the channel and the channel is idle).
I would like to know is there anyone who has prior experience on similar topics?
Captured data

Since the idle noise seems relatively consistent and very different from when data is transmitted, I can think of several simple algorithms which could give you a reasonable threshold in an unsupervised manner.
The most direct method would be to sort the values (perhaps first group into buckets), then find the lowest-valued region where a large enough proportion (at least ~5%) of values fall. Take a reasonable margin above the highest values (50%?) and you should be good to go.
You'll need to fiddle with the thresholds a bit. I'd collect sample data and tweak the values until I get it working 100% of the time and the values used make sense.

Related

Method of averaging periodic timeseries data

I'm in the process of collecting O2 data for work. This data shows periodic behavior. I would like to parse out each repetition to thereby get statistical information like average and theoretical error. Data Figure
Is there a convenient way programmatically:
Identify cyclical data?
Pick out starting & ending indices such that repeating cycle can be concatenated, post-processed, etc.
I had a few ideas, but am more lacking the Python programing experience.
Brute force, condition data in Excel prior. (Will likely collect similar data in future, would like more robust method).
Train NN to identify cycle then output indices. (Limited training set, would have to label).
Decompose to trend/seasonal data apply Fourier series on seasonal data. Pick out N cycles.
Heuristically, i.e. identify thresholds of rate of change & event detection (difficult due to secondary hump, please see data).
Is there a Python program that systematically does this for me? Any help would be greatly appreciated.
Sample Data

How to correlate partial signal with dataset

I have a some large datasets of sensor values consisting of a single sensor value sampled at a one-minute interval, like a waveform. The total dataset spans a few years.
I wish to (using python) enter/select a arbitrary set of sensor data (for instance consisting of 600 values, so for 10hrs worth of data) and find all similar time stamps where roughly the same shape occurred in these datasets.
The matches should be made by shape (relative differences), not by actual values, as there are different sensors used with different biases and environments. Also, I wish to retrieve multiple matches within a single dataset, to further analyse.
I’ve been looking into pandas, but I’m stuck at the moment... any guru here?
I don't know much about the functionalities available in Pandas.
I think you need to first decide the typical time span T over which
the correlation is supposed to occurred. What I would do is to
split all your times series into (possibly overlapping) segments
of duration T using Numpy (see here for instance).
This will lead to a long list of segments. I would then compute
the correlation between all pairs of segments using e.g. corrcoef.
You get a large correlation matrix where you can spot the
pairs of similar segments by applying a threshold on the absolute
value of the correlation. You can estimate the correct threshold
by applying this algorithm to a data set where you don't expect
any correlation, or by randomizing your data.

KMeans: Extracting the parameters/rules that fill up the clusters

I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.

Is there a way to cluster transactions (journals) data using python if a transaction is represented by two or more rows?

In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution

Generate captions for Time Series Data

I am trying to generate captions for time series data based on increasing/decreasing values.
I have a column with values which change gradually over time (time horizon is irrelevant for now). When we visualize a graph, we make comments like increasing/decreasing, steep curve etc.
I am looking at what libraries are available for the same. Has any research been done in graph captioning/ ts captioning?
Currently juggling with pyts and exploring the quantization options by converting values to small clusters and analysing the so-created bag of words.
As of now, its simply looking at when a value changes direction from inc to dec, halt and generate "increasing" for previous segment and move on. This is an inefficient approach in terms of scaling.
Looking for guidance, suggestions, resources.

Categories

Resources