sklearn and featuretools integration?

sklearn and featuretools integration? - python

Has there been work in featuretools (or an additional Python package) that would integrate it with a common ML library, such as sklearn? E.g., it would be nice to test a feature for its predictive power, and if it's high enough, generate more features like it (e.g., use the same initial variable). In other words, can the process of generating new features be guided by their predictive power?

Related

Machine Learning, What are the common techniques for feature engineering and presenting the model?

I am having a ML language identification project (Python) that requires a multi-class classification model with high dimension feature input.
Currently, all I can do to improve accuracy is through trail-and-error. Mindlessly combining available feature extraction algorithms and available ML models and see if I get lucky.
I am asking if there is a commonly accepted workflow that find a ML solution systematically.
This thought might be naive, but I am thinking if I can somehow visualize those high dimension data and the decision boundaries of my model. Hopefully this visualization can help me to do some tuning. In MATLAB, after training, I can choose any two features among all features and MATLAB will give a decision boundary accordingly. Can I do this in Python?
Also, I am looking for some types of graphs that I can use in the presentation to introduce my model and features. What are the most common graphs used in the field?
Thank you

Feature engineering is more of art than technique. That might require domain knowledge or you could try adding, subtracting, dividing and multiplying different columns to make features out of it and check if it adds value to the model. If you are using Linear Regression then the adjusted R-squared value must increase or in the Tree models, you can see the feature importance, etc.

Feature selection using tensorflow

I am using python 3.5 with tensorflow 0.11.
I have a dataset with large number of features (>5000) and relatively small number of samples(<200). I am using wrapper skflow function DNNClassifier for deep learning.
It seems to work work well for classification task, but I want to find some important features from large number of features.
Internally, DNNClassifier seems to perform feature selection(or feature
extraction). Is there any way to perform feature selection with tensorflow?
Or, is there some function to extract the weights of the features?
(There was a function DNNClassifier.weights_, but it seems to be deprecated)
If Tensorflow does not support feature selection or weight information, will it be reasonable to conduct feature selection using other method(such as univariate feature selection) and then try deep learning?
Thank you for help.

You can eval the weights.
For example if your variable is define by
weights = tf.Variable(np.ones([100,10],dtype='float32'), name=weights)
you can get it value at the tensorflow session
value = weights.eval();

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!

Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm

Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

Does the SVM in sklearn support incremental (online) learning?

I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update to changing trends.
From what I can tell, the best way to do this is to make use of machine learning algorithm that supports incremental/online learning.
Algorithms like the Perceptron and Winnow support online learning but I am not completely certain about Support Vector Machines. Does the scikit-learn python library support online learning and if so, is a support vector machine one of the algorithms that can make use of it?
I am obviously not completely tied down to using support vector machines, but they are usually the go to algorithm for binary classification due to their all round performance. I would be willing to change to whatever fits best in the end.

While online algorithms for SVMs do exist, it has become important to specify if you want kernel or linear SVMs, as many efficient algorithms have been developed for the special case of linear SVMs.
For the linear case, if you use the SGD classifier in scikit-learn with the hinge loss and L2 regularization you will get an SVM that can be updated online/incrementall. You can combine this with feature transforms that approximate a kernel to get similar to an online kernel SVM.
One of my specifications is that it should continuously update to changing trends.
This is referred to as concept drift, and will not be handled well by a simple online SVM. Using the PassiveAggresive classifier will likely give you better results, as it's learning rate does not decrease over time.
Assuming you get feedback while training / running, you can attempt to detect decreases in accuracy over time and begin training a new model when the accuracy starts to decrease (and switch to the new one when you believe that it has become more accurate). JSAT has 2 drift detection methods (see jsat.driftdetectors) that can be used to track accuracy and alert you when it has changed.
It also has more online linear and kernel methods.
(bias note: I'm the author of JSAT).

Maybe it's me being naive but I think it is worth mentioning how to actually update the sci-kit SGD classifier when you present your data incrementally:
clf = linear_model.SGDClassifier()
x1 = some_new_data
y1 = the_labels
clf.partial_fit(x1,y1)
x2 = some_newer_data
y2 = the_labels
clf.partial_fit(x2,y2)

Technical aspects
The short answer is no. Sklearn implementation (as well as most of the existing others) do not support online SVM training. It is possible to train SVM in an incremental way, but it is not so trivial task.
If you want to limit yourself to the linear case, than the answer is yes, as sklearn provides you with Stochastic Gradient Descent (SGD), which has option to minimize the SVM criterion.
You can also try out pegasos library instead, which supports online SVM training.
Theoretical aspects
The problem of trend adaptation is currently very popular in ML community. As #Raff stated, it is called concept drift, and has numerous approaches, which are often kinds of meta models, which analyze "how the trend is behaving" and change the underlying ML model (by for example forcing it to retrain on the subset of the data). So you have two independent problems here:
the online training issue, which is purely technical, and can be addressed by SGD or other libraries than sklearn
concept drift, which is currently a hot topic and has no just works answers There are many possibilities, hypothesis and proofes of concepts, while there is no one, generaly accepted way of dealing with this phenomena, in fact many phd dissertations in ML are currenlly based on this issue.

SGD for batch learning tasks normally has a decreasing learning rate and goes over training set multiple times. So, for purely online learning, make sure learning_rate is set to 'constant' in sklearn.linear_model.SGDClassifier() and eta0= 0.1 or any desired value. Therefore the process is as follows:
clf= sklearn.linear_model.SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False, n_iter = 1)
# get x1, y1 as a new instance
clf.partial_fit(x1, y1)
# get x2, y2
# update accuracy if needed
clf.partial_fit(x2, y2)

A way to scale SVM could be split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Updating to trends could be achieved by maintaining a time window each time you run your training pipeline. For example, if you do your training once a day and there is enough information in a month's historical data, create your traning dataset from the historical data obtained in the recent 30 days.

If interested in online learning with concept drift then here is some previous work
Learning under Concept Drift: an Overview
https://arxiv.org/pdf/1010.4784.pdf
The problem of concept drift: definitions and related work
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.9085&rep=rep1&type=pdf
A Survey on Concept Drift Adaptation
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
MOA Concept Drift Active Learning Strategies for Streaming Data
http://videolectures.net/wapa2011_bifet_moa/
A Stream of Algorithms for Concept Drift
http://people.cs.georgetown.edu/~maloof/pubs/maloof.heilbronn12.handout.pdf
MINING DATA STREAMS WITH CONCEPT DRIFT
http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf
Analyzing time series data with stream processing and machine learning
http://www.ibmbigdatahub.com/blog/analyzing-time-series-data-stream-processing-and-machine-learning

Time series forecasting (eventually with python) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
What algorithms exist for time series forecasting/regression ?
What about using neural networks ? (best docs about this topic ?)
Are there python libraries/code snippets that can help ?

The classical approaches to time series regression are:
auto-regressive models (there are whole literatures about them)
Gaussian Processes
Fourier decomposition or similar to extract the periodic components of the signal (i.e., hidden oscillations in the data)
Other less common approaches that I know about are
Slow Feature Analysis, an algorithm that extract the driving forces of a time series, e.g., the parameters behind a chaotic signal
Neural Network (NN) approaches, either using recurrent NNs (i.e., built to process time signals) or classical feed-forward NNs that receive as input part of the past data and try to predict a point in the future; the advantage of the latter is that recurrent NNs are known to have a problem with taking into account the distant past
In my opinion for financial data analysis it is important to obtain not only a best-guess extrapolation of the time series, but also a reliable confidence interval, as the resulting investment strategy could be very different depending on that. Probabilistic methods, like Gaussian Processes, give you that "for free", as they return a probability distribution over possible future values. With classical statistical methods you'll have to rely on bootstrapping techniques.
There are many Python libraries that offer statistical and Machine Learning tools, here are the ones I'm most familiar with:
NumPy and SciPy are a must for scientific programming in Python
There is a Python interface to R, called RPy
statsmodel contains classical statistical model techniques, including autoregressive models; it works well with Pandas, a popular data analysis package
scikits.learn, MDP, MLPy, Orange are collections of machine learning algorithms
PyMC A python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo.
PyBrain contains (among other things) implementations of feed-forward and recurrent neural networks
at the Gaussian Process site there is a list of GP software, including two Python implementations
mloss is a directory of open source machine learning software

I've no idea about python libraries, but there are good forecasting algorithms in R which are open source. See the forecast package for code and references for time series forecasting.

Two approaches
There are two ways on how to deal with temporal structured input for classification, regression, clustering, forecasting and related tasks:
Dedicated Time Series Model: The machine learning algorithm incorporates such time series directly. Such a model is like a black box and it can be hard to explain the behavior of the model. Example are autoregressive models.
Feature based approach: Here the time series are mapped to another, possibly lower dimensional, representation. This means that the feature extraction algorithm calculates characteristics such as the average or maximal value of the time series. The features are then passed as a feature matrix to a "normal" machine learning such as a neural network, random forest or support vector machine. This approach has the advantage of a better explainability of the results. Further it enables us to use a well developed theory of supervised machine learning.
tsfresh calculates a huge number of features
The python package tsfresh calculate a huge number of such features from a pandas.DataFrame containing the time series. You can find its documentation at http://tsfresh.readthedocs.io.
Disclaimer: I am one of the authors of tsfresh.

Speaking only about the algorithms behind them, I recently used the double exponential smoothing in a project and it did well by forecasting new values when there is a trend in the data.
The implementation is pretty trivial, but maybe the algorithm is not sufficiently elaborated for your case.

Did you tried Autocorrelation for finding periodical patterns in time series ?
You can do that with numpy.correlate function.

Group method of data handling is widely used to forecast financial data.

If you want to understand Time Series Forecasting using Python then below link is very helpful.
https://github.com/ManojKumarMaruthi/Time-Series-Forecasting

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sklearn and featuretools integration? - python

Related

Machine Learning, What are the common techniques for feature engineering and presenting the model?

Feature selection using tensorflow

how to predict binary outcome with categorical and continuous features using scikit-learn?

Does the SVM in sklearn support incremental (online) learning?

Time series forecasting (eventually with python) [closed]

Categories

Resources