Deep learning, signal processing and feature engineering

Deep learning, signal processing and feature engineering - python

I have a signals represented in python in dense matrices (the values are y-coordinates from a chart - eg. weather temp etc. in different locations around the world).
I'm currently trying to process/recognize these matrices via different kinds of deep neural networks.
I'm trying to enhance the important parts of the signals in many different ways as well...I know when we talk about sound / human voice...Log mel, MFCC etc. are useful tool to do such and enhance the important parts and vise-versa...
What are my best options to enhance the most important features (in a similar way) in my dense matrices (in other words to feature engineer the most important features)? I'm of course aware about the fact, that the model can't know what the most important features are and not (without some kind of help / feature engineering - but what more can I do as a human to improve the model). Like set the importance of each feature?

Related

Applying CNN to Fast time Fourier Transform?

I have data that fast time fourier transform is applied.
(amplitudes at specific Hzs)
There are solutions on internet that CNN is applied to mel spectrogram, however, I see no solution that CNN is applied to Fast Fourier Transformed signal.
Is it possible that CNN is applied to Fast Fourier Transformed signals?
Or is it not possible because CNN is considering temporal attribute?
Thanks!

I'm assuming each row of your spreadsheet is IID, e.g. it wouldn't change the problem to re-order the rows in that spreadsheet.
In this case you have a pretty typical ML problem. The fact that the FFT has already been applied and specific frequency responses (columns) have been extracted is a process called "feature engineering". Prior to the common use of neural networks, this was a standard step in all machine learning problems and remains common to a great many domains.
With data that has been feature engineered, you should look to traditional ML algorithms. Random Forests, XGBoost, and Linear Regression come to mind. A fully connected neural network is also appropriate, but I would typically expect it to under-perform other ML methods.
The hallmark of a CNN is that it operates on an ordered sequence of data. In your case the raw data, from which your dataset was derived, would be appropriate for a CNN. In a sound file you have a 1D sequence of information. You could not re-order the data in the time dimension without fundamentally changing its meaning.
A 2D CNN operates over an image where the pixel order in X and Y cannot be changed. Again the sequential order of the data matters. The same applies for 3D CNNs.
Be aware that the application of a FFT has fundamentally biased your solution by representing it only in a limited set of frequency responses. All feature engineering is fundamentally biasing the problem, presumably in a well thoughout-out way. However, it's entirely possible that other useful signals in the data exist, which aren't expressed by the FFT # 10, 20, 30 Hz, etc. The CNN has the capacity to learn its own version of an FFT as well as other non cyclic patterns. Typically, the lack of a feature engineering step is the key differentiator between the CNN and traditional ML algorithms.

How do I make a good feature using machine learning on a timeseries forecast that has only traffic volume as an input?

So I have a time series that only has traffic volume. I've done FB prophet and neural prophet. They work okay, but I would like to do something using machine learning. So far I have the problem of trying to make my features. Using the classical dayofyear, month, etc does not give me good results. I have tried using shift where I get the average, minimum, and max of the two previous days. However that would work, but my problem is when I try to predict days in advance the feature doesn't really work for that since I cant get the average of that day. My main concern is trying to find a good feature that my predicting future dataframe also has. A picture of my data is included. Does anyone know how I would do this?

First of all, you have to clarify some definitions. FBProphet works on a mechanism that is the same as any machine learning algorithm that is, fitting the model and then predicting the output. Being an additive regression model with a piecewise linear or logistic growth curve trend, it can be considered as a Machine Learning method that allows us to predict a continuous outcome variable.
Secondly, I think you missed the most important word that your question was about - namely: Feature engineering.
Feature Engineering includes :
Process of using domain knowledge to Extract features (characteristics, properties, attributes) from raw data.
Process of transforming raw data into features that better represent the underlying problem to the predictive models, etc..
But it's very unlikely to use Machine Learning to do Feature engineering. You do Feature engineering in order to improve your Machine Learning model. Many techniques such as imputation, handling outliers, binning, log transform, one-hot encoding, grouping operations, feature split, scaling are hybrid methods using a statistical approach and/or domain knowledge.
Regarding your data, bearing in my mind that the seasonality is already handled by FBProphet, I am not confident if feature engineering transformations such as adding the day of the week, adding holidays periods, etc... could really help improve performance...
To conclude, it is not possible to create ex-nihilo new features that would outperform your model. Whether you process/transform your data or add external domain-knowledge dataset

Training a model from multiple corpus

Imagine I have a fasttext model that had been trained thanks to the Wikipedia articles (like explained on the official website).
Would it be possible to train it again with another corpus (scientific documents) that could add new / more pertinent links between words? especially for the scientific ones ?
To summarize, I would need the classic links that exist between all the English words coming from Wikipedia. But I would like to enhance this model with new documents about specific sectors. Is there a way to do that ? And if yes, is there a way to maybe 'ponderate' the trainings so relations coming from my custom documents would be 'more important'.
My final wish is to compute cosine similarity between documents that can be very scientific (that's why to have better results I thought about adding more scientific documents)

Adjusting more-generic models with your specific domain training data is often called "fine-tuning".
The gensim implementation of FastText allows an existing model to expand its known-vocabulary via what's seen in new training data (via build_vocab(..., update=True)) and then for further training cycles including that new vocabulary to occur (through train()).
But, doing this particular form of updating introduces murky issues of balance between older and newer training data, with no clear best practices.
As just one example, to the extent there are tokens/ngrams in the original model that don't recur in the new data, new training is pulling those in the new data into new positions that are optimal for the new data... but potentially arbitrarily far from comparable compatibility with the older tokens/ngrams.)
Further, it's likely some model modes (like negative-sampling versus hierarchical-softmax), and some mixes of data, have a better chance of net-benefiting from this approach than others – but you pretty much have to hammer out the tradeoffs yourself, without general rules to rely upon.
(There may be better fine-tuning strategies for other kinds models; this is just speaking to the ability of the gensim FastText to update-vocabulary and repeat-train.)
But perhaps, your domain of interest is scientific texts. And maybe you also have a lot of representative texts – perhaps even, at training time, the complete universe of papers you'll want to compare.
In that case, are you sure you want to deal with the complexity of starting with a more-generic word-model? Why would you want to contaminate your analysis with any of the dominant word-senses in generic reference material, like Wikipedia, if in fact you already have sufficiently-varied and representative examples of your domain words in your domain contexts?
So I would recommend 1st trying to train your own model, from your own representative data. And only if you then fear you're missing important words/senses, try mixing in Wikipedia-derived senses. (At that point, another way to mix in that influence would be to mix Wikipedia texts with your other corpus. And you should also be ready to test whether that really helps or hurts – because it could be either.)
Also, to the extent your real goal is comparing full papers, you might want to look into other document-modeling strategies, including bag-of-words representations, the Doc2Vec ('Paragraph Vector') implementation in gensim, or others. Those approaches will not necessarily require per-word vectors as an input, but might still work well for quantifying text-to-text similarities.

Machine Learning, What are the common techniques for feature engineering and presenting the model?

I am having a ML language identification project (Python) that requires a multi-class classification model with high dimension feature input.
Currently, all I can do to improve accuracy is through trail-and-error. Mindlessly combining available feature extraction algorithms and available ML models and see if I get lucky.
I am asking if there is a commonly accepted workflow that find a ML solution systematically.
This thought might be naive, but I am thinking if I can somehow visualize those high dimension data and the decision boundaries of my model. Hopefully this visualization can help me to do some tuning. In MATLAB, after training, I can choose any two features among all features and MATLAB will give a decision boundary accordingly. Can I do this in Python?
Also, I am looking for some types of graphs that I can use in the presentation to introduce my model and features. What are the most common graphs used in the field?
Thank you

Feature engineering is more of art than technique. That might require domain knowledge or you could try adding, subtracting, dividing and multiplying different columns to make features out of it and check if it adds value to the model. If you are using Linear Regression then the adjusted R-squared value must increase or in the Tree models, you can see the feature importance, etc.

Time series forecasting (eventually with python) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
What algorithms exist for time series forecasting/regression ?
What about using neural networks ? (best docs about this topic ?)
Are there python libraries/code snippets that can help ?

The classical approaches to time series regression are:
auto-regressive models (there are whole literatures about them)
Gaussian Processes
Fourier decomposition or similar to extract the periodic components of the signal (i.e., hidden oscillations in the data)
Other less common approaches that I know about are
Slow Feature Analysis, an algorithm that extract the driving forces of a time series, e.g., the parameters behind a chaotic signal
Neural Network (NN) approaches, either using recurrent NNs (i.e., built to process time signals) or classical feed-forward NNs that receive as input part of the past data and try to predict a point in the future; the advantage of the latter is that recurrent NNs are known to have a problem with taking into account the distant past
In my opinion for financial data analysis it is important to obtain not only a best-guess extrapolation of the time series, but also a reliable confidence interval, as the resulting investment strategy could be very different depending on that. Probabilistic methods, like Gaussian Processes, give you that "for free", as they return a probability distribution over possible future values. With classical statistical methods you'll have to rely on bootstrapping techniques.
There are many Python libraries that offer statistical and Machine Learning tools, here are the ones I'm most familiar with:
NumPy and SciPy are a must for scientific programming in Python
There is a Python interface to R, called RPy
statsmodel contains classical statistical model techniques, including autoregressive models; it works well with Pandas, a popular data analysis package
scikits.learn, MDP, MLPy, Orange are collections of machine learning algorithms
PyMC A python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo.
PyBrain contains (among other things) implementations of feed-forward and recurrent neural networks
at the Gaussian Process site there is a list of GP software, including two Python implementations
mloss is a directory of open source machine learning software

I've no idea about python libraries, but there are good forecasting algorithms in R which are open source. See the forecast package for code and references for time series forecasting.

Two approaches
There are two ways on how to deal with temporal structured input for classification, regression, clustering, forecasting and related tasks:
Dedicated Time Series Model: The machine learning algorithm incorporates such time series directly. Such a model is like a black box and it can be hard to explain the behavior of the model. Example are autoregressive models.
Feature based approach: Here the time series are mapped to another, possibly lower dimensional, representation. This means that the feature extraction algorithm calculates characteristics such as the average or maximal value of the time series. The features are then passed as a feature matrix to a "normal" machine learning such as a neural network, random forest or support vector machine. This approach has the advantage of a better explainability of the results. Further it enables us to use a well developed theory of supervised machine learning.
tsfresh calculates a huge number of features
The python package tsfresh calculate a huge number of such features from a pandas.DataFrame containing the time series. You can find its documentation at http://tsfresh.readthedocs.io.
Disclaimer: I am one of the authors of tsfresh.

Speaking only about the algorithms behind them, I recently used the double exponential smoothing in a project and it did well by forecasting new values when there is a trend in the data.
The implementation is pretty trivial, but maybe the algorithm is not sufficiently elaborated for your case.

Did you tried Autocorrelation for finding periodical patterns in time series ?
You can do that with numpy.correlate function.

Group method of data handling is widely used to forecast financial data.

If you want to understand Time Series Forecasting using Python then below link is very helpful.
https://github.com/ManojKumarMaruthi/Time-Series-Forecasting

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.