I have spent the best part of the last few days searching forums and reading papers trying to solve the following question. I have thousands of time series arrays each of varying lengths containing a single column vector. this column vector contains the time between clicks for dolphins using echolocation.
I have managed to cluster these into similar groups using DTW and want to check which trains have a high degree of similarity i.e repeated patterns. I only want to know the similarity with themselves and don't care to compare them with other trains as I have already applied DTW for that. I'm hoping some of these clusters will contain trains with a high proportion of repeated patterns.
I have already applied the Ljung–Box test to each series to check for autocorrelation but think i should maybe be using something with FFT and the power spectrum. I don't have much experience in this but have tried to do so using a Python package waipy. Ultimately, I just want to know if there is some kind of repeated pattern in the data ideally tested with a p-value. The image I have attached shows an example train across the top. the maximum length of my trains is 550.
example output from Waipy
I know this is quite a complex question but any help would be greatly appreciated even if it is a link to a helpful Python library.
Thanks,
Dex
For anyone in a similar position I decided to go with Motifs as they are able to find a repeated pattern in a time series using euclidian distance. There is a really good package in Python called Stumpy which makes this very easy!
Thanks,
Dex
Related
I'm in the process of collecting O2 data for work. This data shows periodic behavior. I would like to parse out each repetition to thereby get statistical information like average and theoretical error. Data Figure
Is there a convenient way programmatically:
Identify cyclical data?
Pick out starting & ending indices such that repeating cycle can be concatenated, post-processed, etc.
I had a few ideas, but am more lacking the Python programing experience.
Brute force, condition data in Excel prior. (Will likely collect similar data in future, would like more robust method).
Train NN to identify cycle then output indices. (Limited training set, would have to label).
Decompose to trend/seasonal data apply Fourier series on seasonal data. Pick out N cycles.
Heuristically, i.e. identify thresholds of rate of change & event detection (difficult due to secondary hump, please see data).
Is there a Python program that systematically does this for me? Any help would be greatly appreciated.
Sample Data
I'm using a dataset to predict the effects on the economy because of covid-19. The dataset contains 9k rows and around 1k rows in each column is empty. Do I need to fill them manually by looking at other datasets online or can I fill the average or should I leave the dataset as it is?
Generally, I'd say that combining datasets from multiple sources without being really clear about your rational can raise pretty big questions about the reliability of your data.
Otherwise, either assuming averages or leaving null are both valid options depending on what you're trying to do. If you're using scikit learn (eg) you'll probably find that nulls throw up errors, so filling with assumed averages is a relatively common thing to do. Although you might want to watch out given you've got more that 10% nulls!
From experience, I'd say think about what you're trying to do, and what will help you get there best. Then be really clear about presenting your methodology with your findings.
I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.
I have a huge database (~100 variables with a few million rows) consisting of stock data. I managed to connect python with the database via sqlalchemy (postgreql+psycopg2). I am running it all on the cloud.
In principle I want to do a few things:
1) Regression of all possible combinations: I am running a simple regression of each stock, i.e. ABC on XYZ AND also XYZ on ABC, this across the n=100 stocks, resulting in n(n+1) / 2 combinations.
-> I think of a function that calls in the pairs of stocks, does the two regressions and compares the results and picks one based on some criteria.
My question: Is there an efficient way to call in the "factorial"?
2) Rolling Windows: To avoid an overload of data, I thought to only call the dataframe of investigation, i.e. 30days, and then roll over each day, meaning my periods are:
1: 1D-30D
2: 2D-31D and so on
Meaning I always drop the first day and add another row at the end of my dataframe. So meaning I have two steps, drop the first day and read in the next row from my database.
My question: Is this a meaningful way or does Python has something better in its sleeve? How would you do it?
3) Expanding windows: Instead of dropping the first row and add another one, I keep the 30 days and add another 30days and then run my regression. Problem here, at some point I would embrace all the data which will probably be too big for the memory?
My question: What would be a workaround here?
4) As I am running my analysis on the cloud (with a few more cores than my own pc) in fact I could use multithreading, sending "batch" jobs and let Python do things in parallel. I thought of splitting my dataset in 4x 25 stocks and let it run in parallel (so vertical split), or should I better split horizontally?
Additionally I am using Jupyter; I am wondering how to best approach here, usually I have a shell script calling my_program.py. Is this the same here?
Let me try to give answers categorically and also note my observations.
From your description, I suppose you have taken each stock scrip as one variable and you are trying to perform pairwaise linear regression amongst them. Good news about this - it's highly parallizable. All you need to do is generate unique combinations of all possible pairings and perform your regressions and then only to keep those models which fit your criteria.
Now as stocks are your variables, I am assuming rows are their prices or something similar values but definitely some time series data. If my assumption is correct then there is a problem in rolling window approach. In creating these rolling windows what you are implicitly doing is using a data sampling method called 'bootstrapping' which uses random but repeatitive sampling. But due to just rolling your data you are not using random sampling which might create problems for your regression results. At best the model may simply be overtrained, at worst, I cannot imagine. Hence, drop this appraoch. Plus if it's a time series data then the entire concept of windowing would be questionable anyway.
Expanding windows are no good for the same reasons stated above.
About memory and processibility - I think this is an excellent scenario where one can use Spark. It is exactly built for this purpose and has excellent support for python. Millions of data points are no big deal for Spark. Plus, you would be able to massively parallelize your operations. Being on cloud infrastructure also gives you advantage about configurability and expandability without headache. I don't know why people like to use Jupyter even for batch tasks like these but if you are hell-bent on using it, then PySpark kernel is also supported by Jupyter. Vertical split would be right approach here probably.
Hope these answer your questions.
I am looking to compute similarities between users and text documents using their topic representations. I.e. each document and user is represented by a vector of topics (e.g. Neuroscience, Technology, etc) and how relevant that topic is to the user/document.
My goal is then to compute the similarity between these vectors, so that I can find similar users, articles and recommended articles.
I have tried to use Pearson Correlation but it ends up taking too much memory and time once it reaches ~40k articles and the vectors' length is around 10k.
I am using numpy.
Can you imagine a better way to do this? or is it inevitable (on a single machine)?
Thank you
I would recommend just using gensim for this instead of rolling your own.
Don't quite understand why you end up taking too much memory for just computing the correlation for O(n^2) pair of items. To calculate Pearson Correlation, as wikipedia article pointed out,
That is, to get the corr(X,Y) you need only two vectors at a time. If you process your data one pair at a time, memory should not be a problem at all.
If you are going to load all vectors and do some matrix factorization, that is another story.
For computation time, I totally understand because you need to compare this for O(n^2) pair of items.
Gensim is known to be able to run with modest memory requirements (< 1 Gb) on a single CPU/desktop computer within a reasonable time frame. Check this about an experiment they have done on a dataset of 8.2GB using MacBook Pro, Intel Core i7 2.3GHz, 16GB DDR3 RAM. I think it is a larger dataset than you have.
If you have a even larger dataset, you might want to try distributed version of gensim or even map/reduce.
Another approach is to try locality sensitive hashing.
My tricks are using a search engine such as ElasticSearch, and it works very well, and in this way we unified the api of all our recommend systems. Detail is listed as below:
Training the topic model by your corpus, each topic is an array of words and each of the word is with a probability, and we take the first 6 most probable words as a representation of a topic.
For each document in your corpus, we can inference a topic distribution for it, the distribution is an array of probabilities for each topic.
For each document, we generate a fake document with the topic distribution and the representation of the topics, for example the size of the fake document is about 1024 words.
For each document, we generate a query with the topic distribution and the representation of the topics, for example the size of the query is about 128 words.
All preparation is finished as above. When you want to get a list of similar articles or others, you can just perform a search:
Get the query for your document, and then perform a search by the query on your fake documents.
We found this way is very convenient.