From a recorded time based Panda dataset, I need to compare various signals to a reference signal. Around the reference signal I define sliding soft limits and constant hard limits. Soft limits are allowed to be crossed if it's for less time than the allowed excursion time.For a visual explanation, please see image below.
Does any existing python library perform this type of data analysis? I did a thorough search, but I couldn't find anything. Thanks.
This sounds specific to what you are doing, and not too tricky to implement, so doing so might not be a be a bad idea :) It will also give you more freedom in the future should you need to make your analysis more complicated.
Related
I'm producing an ugv prototype. The goal is to perform the desired actions to the targets set within the maze. When I surf the Internet, the mere right to navigate in the labyrinth is usually made with a distance sensor. I want to consult more ideas than the question.
I want to navigate the labyrinth by analyzing the image from the 3d stereo camera. Is there a resource or successful method you can suggest for this? As a secondary problem, the car must start in front of the entrance of the labyrinth, see the entrance and go in, and then leave the labyrinth after it completes operations in the labyrinth.
I would be glad if you suggest a source for this problem. :)
The problem description is a bit vague, but i'll try to highlight some general ideas.
An useful assumption is that labyrinth is a 2D environment which you want to explore. You need to know, at every moment, which part of the map has been explored, which part of the map still needs exploring, and which part of the map is accessible in any way (in other words, where are the walls).
An easy initial data structure to help with this is a simple matrix, where each cell represents a square in the real world. Each cell can be then labelled according to its state, starting in an unexplored state. Then you start moving, and exploring. Based on the distances reported by the camera, you can estimate the state of each cell. The exploration can be guided by something such as A* or Q-learning.
Now, a rather subtle issue is that you will have to deal with uncertainty and noise. Sometimes you can ignore it, sometimes you don't. The finer the resolution you need, the bigger is the issue. A probabilistic framework is most likely the best solution.
There is an entire field of research of the so-called SLAM algorithms. SLAM stands for simultaneous localization and mapping. They build a map using some sort of input from various types of cameras or sensors, and they build a map. While building the map, they also solve the localization problem within the map. The algorithms are usually designed for 3d environments, and are more demanding than the simpler solution indicated above, but you can find ready to use implementations. For exploration, something like Q-learning still have to be used.
Disclaimer: This is probably a difficult question to answer, but I would greatly appreciate any insights, thoughts or suggestions. If this has already been answered elsewhere and I simply haven't managed to find it, please let me know. Also, I'm somewhat new to algorithm engineering in general, specifically to using Python/NumPy for implementing and evaluating non-trivial algorithm prototypes, and picking it all up as I go. I may be missing something fundamental to scientific computing with NumPy.
To moderator: Feel free to move this thread if there is a better forum for it. I'm not sure this strictly qualifies for the Computational Science forum as it potentially boils down to an implementation/coding issue.
Note: If you want to understand the problem and context, read everything.
I'm using NumPy's std() function to calculate standard deviations in the implementation of an algorithm that finds optimal combinations of nodes in a graph, loosely based on minspan trees. A selection function contains the call to std(). This is an early single-threaded prototype of the algorithm; the algorithm was originally designed for multi-threading (which will likely be implemented in C or C++, so not relevant here).
Edge weights are dependent on properties of nodes previously selected and so are calculated when the algorithm examines an available node. The graph may contain several thousand nodes. At this stage searches are exhaustive, potentially requiring hundreds of thousands of calculations.
Additionally, evaluations of the algorithm are automated and may run hundreds of consecutive trials depending on user input. I haven't seen errors (below) crop up during searches of smaller graphs (e.g., 100 nodes), however scaling the size of the graph close to 1000 guarantees that they make an appearance.
The relevant code can be reduced to roughly the following:
# graph.available = [{'id': (int), 'dims': {'dim1': (int), ...}}, ...]
# accumulatedDistr = {'dim1': (int), ...}
# Note: dicts as nodes, etc used here for readability
edges = []
for node in graph.available:
intersection = my.intersect(node['distr'], accumulatedDistr) # Returns list
edgeW = numpy.std(intersection)
edges.append((node['id'], edgeW))
# Perform comparisons, select and combine into accumulatedDistr
Distributions are guaranteed to contain non-negative, non-zero integer values and the lists returned from my.intersect() are likewise guaranteed to never be empty.
When I run a series of trials I'll occasionally see the following messages in the terminal:
C:\Program Files\Python36\lib\site-packages\numpy\core\_methods.py:135: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
C:\Program Files\Python36\lib\site-packages\numpy\core\_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
C:\Program Files\Python36\lib\site-packages\numpy\core\_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
They typically only appear a few times during a set of trials, so likely every few million calculations. However, one bad calculation can (at the least!) subtly alter the results of an entire trial, so I'd still like to prevent this if at all possible. I assume there must be some internal state that's accumulating errors, but I've found precious little to suggest how I might address the problem. This concerns me greatly as accumulation of errors casts doubt on all of the following calculations, if that is indeed the case.
I suppose that I may have to look for a proprietary library (e.g., Python wrappers for Intel kernal math libs) to guarantee the kind of extreme-volume (pardon the abuse of terminology) computational stability that I want. But first: Is there a way to prevent them (NumPy)? Also, just in case: Is the problem as serious as I'm afraid it could be?
I've confirmed that this is indeed a bug in my own code, despite never catching it in the tests. Granted, on the evidence I would have had to run a few million or so consecutive randomized-input tests. On reflection, that might not be a bad idea as general practice for critical sections of code, despite the amount of time it takes. Better to catch it early on than after you've built an entire library around the affected code. Lesson learned.
Thanks to BrenBarn for putting me on the right track! I've run across open source libraries in the past that did have rare, hard to hit bugs. I'm relieved NumPy isn't one of them. At least not this time. Still, I think there's room to complain about vague, unhelpful error messages. Welcome to NumPy, I suppose.
So, to answer my own question: NumPy wasn't the problem. np.std() is very stable. Hopefully my experiences here will help others rule out what isn't causing their code to collapse.
Introduction
I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process.
Proposed Solution Specifications
I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). I tend to use the topicmodels package in R, however I would like to switch to MALLET as it offers many benefits over topicmodels. It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. However some of the issues outline below are also relevant for topicmodels too. I'd like to know how others approach topic modelling and which of the below steps could be improved. Any useful piece of advice is welcome.
Outline
Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step.
Proposed Workflow
1. Clean text
This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix from R's package tm.
Problem: This however may need to be performed on the text strings directly, using functions such as gsub in order for MALLET to consume these strings. Performing in on the strings directly is not as efficient as it involves repetition (e.g. the same word would have to be stemmed several times)
2. Construct features
In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. It is preferable to limit your bag of features to about 1000 or so. Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. While I'm familiar with the concept of stem-completion, I find that it rarely works well.
Problem: (1) Unfortunately MALLET does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. This difference would become stop words for MALLET. (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts.
Side note: If you decide to stick with R alone, then I can recommend the quanteda package which has a function dfm that accepts a thesaurus as one of the parameters. This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\bsign\\w*.?ups? that would match sign-up, signed up and so on.
3. Find optimal parameters
This is a hard one. I tend to break data into test-train sets and run cross-validation fitting a model of k topics and testing the fit using held-out data. Log likelihood is recorded and compared for different resolutions of topics.
Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on.
Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute.
4. Maintenance
It is easy to classify new data into a set existing topics. However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. Furthermore, it might be of interest to study the lifecycle of topics. This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way.
Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer.
Recap of Problems
String cleaning for MALLET to consume the data is inefficient.
Feature selection requires manual work.
Optimal number of topics selection based on LL does not account for what is practically sensible
Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics)
Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant.
If you've read that far, I'd like to thank you, this is a rather long post. If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems.
Cheers
Thank you for this thorough summary!
As an alternative to topicmodels try the package mallet in R. It runs Mallet in a JVM directly from R and allows you to pull out results as R tables. I expect to release a new version soon, and compatibility with tm constructs is something others have requested.
To clarify, it's a good idea for documents to be at most around 1000 tokens long (not vocabulary). Any more and you start to lose useful information. The assumption of the model is that the position of a token within a given document doesn't tell you anything about that token's topic. That's rarely true for longer documents, so it helps to break them up.
Another point I would add is that documents that are too short can also be a problem. Tweets, for example, don't seem to provide enough contextual information about word co-occurrence, so the model often devolves into a one-topic-per-doc clustering algorithm. Combining multiple related short documents can make a big difference.
Vocabulary curation is in practice the most challenging part of a topic modeling workflow. Replacing selected multi-word terms with single tokens (for example by swapping spaces for underscores) before tokenizing is a very good idea. Stemming is almost never useful, at least for English. Automated methods can help vocabulary curation, but this step has a profound impact on results (much more than the number of topics) and I am reluctant to encourage people to fully trust any system.
Parameters: I do not believe that there is a right number of topics. I recommend using a number of topics that provides the granularity that suits your application. Likelihood can often detect when you have too few topics, but after a threshold it doesn't provide much useful information. Using hyperparameter optimization makes models much less sensitive to this setting as well, which might reduce the number of parameters that you need to search over.
Topic drift: This is not a well understood problem. More examples of real-world corpus change would be useful. Looking for changes in vocabulary (e.g. proportion of out-of-vocabulary words) is a quick proxy for how well a model will fit.
I am trying to solve a problem very similar to the one discussed in this post
I have a broadband signal, which contains a component with time-varying frequency. I need to monitor the phase of this component over time. I am able to track the frequency shifts by (a somewhat brute force method of) peak tracking in the spectrogram. I need to "clean up" the signal around this time varying peak to extract the Hilbert phase (or, alternatively, I need a method of tracking the phase that does not involve the Hilbert transform).
To summarize that previous post: varying the coefficients of a FIR/IIR filter in time causes bad things to happen (it does not just shift the passband, it also completely confuses the filter state in ways that cause surprising transients). However, there probably is some way to adjust filter coefficients in time (probably by jointly modifying the filter coefficients and the filter state in some intelligent way). This is beyond my expertise, but I'd be open to any solutions.
There were two classes of solutions that seem plausible: one is to use a resonator filter (basically a damped harmonic oscillator driven by the signal) with a time-varying frequency. This model is simple enough to avoid surprising filter transients. I will try this -- but resonators have very poor attenuation in the stop band (if they can even be said to have a stop band?). This makes me nervous as I'm not 100% sure how the resonate filters will behave.
The other suggestion was to use a filter bank and smoothly interpolate between various band-pass filtered signals according to the frequency. This approach seems appealing, but I suspect it has some hidden caveats. I imagine that linearly mixing two band-pass filtered signals might not always do what you would expect, and might cause weird things? But, this is not my area of expertise, so if mixing over a filter bank is considered a safe solution (one that has been analyzed and published before), I would use it.
Another potential class of solutions occurs to me, which is to just take the phase from the frequency peak in a sliding short-time Fourier transform (could be windowed, multitaper, etc). If anyone knows any prior literature on this I'd be very interested. Related, would be to take the phase at the frequency power peak from a sliding complex Morlet wavelet transform over the band of interest.
So, I guess, basically I have three classes of solutions in mind.
1. Resonator filters with time-varying frequncy.
2. Using a filter bank, possibly with mixing?
3. Pulling phase from a STFT or CWT, (these can be considered a subset of the filter bank approach)
My supicion is that in (2,3) surprising thing will happen to the phase from time to time, and that in (1) we may not be able to reject as much noise as we'd like. It's not clear to me that this problem even has a perfect solution (uncertainty principle in time-frequency resolution?).
Anyway, if anyone has solved this before, and... even better, if anyone knows any papers that sound directly applicable here, I would be grateful.
Not sure if this will help, but googling "monitor phase of time varying component" resulted in this: Link
Hope that helps.
I am writing a scientific program in Python and C with some complex physical simulation algorithms. After implementing algorithm, I found that there are a lot of possible optimizations to improve performance. Common ones are precalculating values, getting calculations out of cycle, replacing simple matrix algorithms with more complex and other. But there arises a problem. Unoptimized algorithm is much slower, but its logic and connection with theory look much clearer and readable. Also, it's harder to extend and modify optimized algorithm.
So, the question is - what techniques should I use to keep readability while improving performance? Now I am trying to keep both fast and clear branches and develop them in parallel, but maybe there are better methods?
Just as a general remark (I'm not too familiar with Python): I would suggest you make sure that you can easily exchange the slow parts of the 'reference implementation' with the 'optimized' parts (e.g., use something like the Strategy pattern).
This will allow you to cross-validate the results of the more sophisticated algorithms (to ensure you did not mess up the results), and will keep the overall structure of the simulation algorithm clear (separation of concerns). You can place the optimized algorithms into separate source files / folders / packages and document them separately, in as much detail as necessary.
Apart from this, try to avoid the usual traps: don't do premature optimization (check if it is actually worth it, e.g. with a profiler), and don't re-invent the wheel (look for available libraries).
Yours is a very good question that arises in almost every piece of code, however simple or complex, that's written by any programmer who wants to call himself a pro.
I try to remember and keep in mind that a reader newly come to my code has pretty much the same crude view of the problem and the same straightforward (maybe brute force) approach that I originally had. Then, as I get a deeper understanding of the problem and paths to the solution become clearer, I try to write comments that reflect that better understanding. I sometimes succeed and those comments help readers and, especially, they help me when I come back to the code six weeks later. My style is to write plenty of comments anyway and, when I don't (because: a sudden insight gets me excited; I want to see it run; my brain is fried), I almost always greatly regret it later.
It would be great if I could maintain two parallel code streams: the naïve way and the more sophisticated optimized way. But I have never succeeded in that.
To me, the bottom line is that if I can write clear, complete, succinct, accurate and up-to-date comments, that's about the best I can do.
Just one more thing that you know already: optimization usually doesn't mean shoehorning a ton of code onto one source line, perhaps by calling a function whose argument is another function whose argument is another function whose argument is yet another function. I know that some do this to avoid storing a function's value temporarily. But it does very little (usually nothing) to speed up the code and it's a bitch to follow. No news to you, I know.
It is common to assume you must give up readability to get performance.
That's not necessarily so.
You need to find out What exactly is it spending much time doing, and why?
Notice, I didn't say you need to do any measuring.
Here's an example of what I mean.
Chances are very good that you can do some simple changes to avoid waste motion, but don't fix anything until the program itself has told you what to fix.
def whatYouShouldDo(servings, integration_method=oven):
"""
Make chicken soup
"""
# Comments:
# They are important. With some syntax highlighting, the comments are
# the first thing a new programmer will look for. Therefore, they should
# motivate your algorithm, outline it, and break it up into stages.
# You can MAKE IT FEEL AS IF YOU ARE READING TEXT, interspersing code
# amongst the text.
#
# Algorithm overview:
# To make chicken soup, we will acquire chicken broth and some noodles.
# Preprocessing ingredients is done to optimize cooking time. Then we
# will output in SOUP format via stdout.
#
# BEGIN ALGORITHM
#
# Preprocessing:
# 1. Thaw chicken broth.
broth = chickenstore.deserialize()
# 2. Mix with noodles
if not noodles in cache:
with begin_transaction(locals=poulty) as t:
cache[noodles] = t.buy(noodles) # get from local store
noodles = cache[noodles]
# 3. Perform 4th-order Runge-Kutta numerical integration
import kitchensink import * # FIXME: poor form, better to from kitchensink import pan at beginning
result = boilerplate.RK4(broth `semidirect_product` noodles)
# 4. Serve hot
log.debug('attempting to serve')
return result
log.debug('server successful')
also see http://en.wikipedia.org/wiki/Literate_programming#Example
I've also heard that this is what http://en.wikipedia.org/wiki/Aspect-oriented_programming attempts to help with, though I haven't really looked into it. (It just seems to be a fancy way of saying "put your optimizations and your debugs and your other junk outside of the function you're writing".)