Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
In my project, we are supposed to use an SVM based algorithm. So to get a basic idea about implementation of SVM, we are trying to implement an algorithm, which when fed with an array of 1000 integers where first 95 integers are of values ranging from 0-5, then the next 5 around 10,000 and then again 95 integers of values ranging from 0-5 and next 5 around 10,000 and so on, will be able to predict the next 100 integers (1001st - 1100th) with first 95 integers around 0-5 and the last 5 around 10,000 ...
How to implement it? Preferred programming language is python. So are there any svm modules like libsvm which will facilitate this?
I know this might be a stupid question, but any help would be appreciated a lot !!
Please reply
Here are some resources on AI (SVM specifically) from the Python wiki:
Milk - Milk is a machine learning toolkit in Python. Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.
LibSVM - LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. A Python interface is available by by default.
Shogun - The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM) . It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS, Liblinear, LibSVM, SVMLight, SVMLin and GPDT. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels. SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released as Machine Learning Open Source Software
I'd go with SVM from SciKit. Other options include svmlight and LaSVM.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Are these libraries fairly interchangeable?
Looking here, https://stackshare.io/stackups/keras-vs-pytorch-vs-scikit-learn, it seems the major difference is the underlying framework (at least for PyTorch).
Yes, there is a major difference.
SciKit Learn is a general machine learning library, built on top of NumPy. It features a lot of machine learning algorithms such as support vector machines, random forests, as well as a lot of utilities for general pre- and postprocessing of data. It is not a neural network framework.
PyTorch is a deep learning framework, consisting of
A vectorized math library similar to NumPy, but with GPU support and a lot of neural network related operations (such as softmax or various kinds of activations)
Autograd - an algorithm which can automatically calculate gradients of your functions, defined in terms of the basic operations
Gradient-based optimization routines for large scale optimization, dedicated to neural network optimization
Neural-network related utility functions
Keras is a higher-level deep learning framework, which abstracts many details away, making code simpler and more concise than in PyTorch or TensorFlow, at the cost of limited hackability. It abstracts away the computation backend, which can be TensorFlow, Theano or CNTK. It does not support a PyTorch backend, but that's not something unfathomable - you can consider it a simplified and streamlined subset of the above.
In short, if you are going with "classic", non-neural algorithms, neither PyTorch nor Keras will be useful for you. If you're doing deep learning, scikit-learn may still be useful for its utility part; aside from it you will need the actual deep learning framework, where you can choose between Keras and PyTorch but you're unlikely to use both at the same time. This is very subjective, but in my view, if you're working on a novel algorithm, you're more likely to go with PyTorch (or TensorFlow or some other lower-level framework) for flexibility. If you're adapting a known and tested algorithm to a new problem setting, you may want to go with Keras for its greater simplicity and lower entry level.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have started learning Tensorflow recently and I am wondering if it is worth using in simple optimization problems (least squares, maximum likelihood estimation, ...) instead of more traditional libraries (scikit-learn, statsmodel)?
I have implemented a basic AR model estimator using Tensorflow with MLE and the AdamOptimizer and the results are not convincing either performance or computation speed wise.
What do you think?
This is somewhat opinion based, but Tensorflow and similar frameworks such as PyTorch are useful when you want to optimize an arbitrary, parameter-rich non-linear function (e.g., a deep neural network). For a 'standard' statistical model, I would use code that was already tailored to it instead of reinventing the wheel. This is true especially when there are closed-form solutions (as in linear least squares) - why go into to the murky water of local optimization when you don't have to? Another advantage of using existing statistical libraries is that they usually provide you with measures of uncertainty about your point estimates.
I see one potential case in which you might want to use Tensorflow for a simple linear model: when the number of variables is so big the model can't be estimated using closed-form approaches. Then gradient descent based optimization makes sense, and tensorflow is a viable tool for that.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I was wondering whether and how it is possible to use a python generator as data input to scikit-learn classifier's .fit() functions? Due to huge amounts of data, this seems to make sense to me.
In particular I am about to implement a random forest approach.
Regards
K
The answer is "no". To do out of core learning with random forests, you should
Split your data into reasonably-sized batches (restricted by the amount of RAM you have; bigger is better);
train separate random forests;
append all the underlying trees together in the estimators_ member of one of the trees (untested):
for i in xrange(1, len(forests)):
forests[0].estimators_.extend(forests[i].estimators_)`
(Yes, this is hacky, but no solution to this problem has been found yet. Note that with very large datasets, it might pay to just sample a number training examples that fits in the RAM of a big machine instead of training on all of it. Another option is to switch to linear models with SGD, those implement a partial_fit method, but obviously they're limited in the kind of functions they can learn.)
The short answer is "No, you can't". Classical Random Forest classifier is not an incremental or online classifier, so you can't discard training data while learning, and have to provide all the dataset at once.
Due to popularity of RF in machine learning (not least because of the good prediction results for some interesting cases), there are some attempts to implement online variation of Random Forest, but to my knowledge those are not yet implemented in any python ML package.
See Amir Saffari's page for such an approach (not Python).
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Text Classification into Categories
I am currently working on a solution to get the type of food served in a database with 10k restaurants based on their description. I'm using lists of keywords to decide which kind of food is being served.
I read a little bit about machine learning but I have no practical experience with it at all. Can anyone explain to me if/why it would a be better solution to a simple problem like this? I find accuracy more important than performance!
simplified example:
["China", "Chinese", "Rice", "Noodles", "Soybeans"]
["Belgium", "Belgian", "Fries", "Waffles", "Waterzooi"]
a possible description could be:
"Hong's Garden Restaurant offers savory, reasonably priced Chinese to our customers. If you find that you have a sudden craving for
rice, noodles or soybeans at 8 o’clock on a Saturday evening, don’t worry! We’re open seven days a week and offer carryout service. You can get fries here as well!"
You are indeed describing a classification problem, which can be solved with machine learning.
In this problem, your features are the words in the description. You should use the Bag Of Words model - which basically says that the words and their number of occurrences for each word is what matters to the classification process.
To solve your problem, here are the steps you should do:
Create a feature extractor - that given a description of a restaurant, returns the "features" (under the Bag Of Words model explained above) of this restaurant (denoted as example in the literature).
Manually label a set of examples, each will be labeled with the desired class (Chinese, Belgian, Junk food,...)
Feed your labeled examples into a learning algorithm. It will generate a classifier. From personal experience, SVM usually gives the best results, but there are other choices such as Naive Bayes, Neural Networks and Decision Trees (usually C4.5 is used), each has its own advantage.
When a new (unlabeled) example (restaurant) comes - extract the features and feed it to your classifier - it will tell you what it thinks it is (and usually - what is the probability the classifier is correct).
Evaluation:
Evaluation of your algorithm can be done with cross-validation, or seperating a test set out of your labeled examples that will be used only for evaluating how accurate the algorithm is.
Optimizations:
From personal experience - here are some optimizations I found helpful for the feature extraction:
Stemming and eliminating stop words usually helps a lot.
Using Bi-Grams tends to improve accuracy (though increases the feature space significantly).
Some classifiers are prone to large feature space (SVM not included), there are some ways to overcome it, such as decreasing the dimensionality of your features. PCA is one thing that can help you with it. Genethic Algorithms are also (empirically) pretty good for subset selection.
Libraries:
Unfortunately, I am not fluent enough with python, but here are some libraries that might be helpful:
Lucene might help you a lot with the text analysis, for example - stemming can be done with EnglishAnalyzer. There is a python version of lucene called PyLucene, which I believe might help you out.
Weka is an open source library that implements a lot of useful things for Machine Learning - many classifiers and feature selectors included.
Libsvm is a library that implements the SVM algorithm.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
What algorithms exist for time series forecasting/regression ?
What about using neural networks ? (best docs about this topic ?)
Are there python libraries/code snippets that can help ?
The classical approaches to time series regression are:
auto-regressive models (there are whole literatures about them)
Gaussian Processes
Fourier decomposition or similar to extract the periodic components of the signal (i.e., hidden oscillations in the data)
Other less common approaches that I know about are
Slow Feature Analysis, an algorithm that extract the driving forces of a time series, e.g., the parameters behind a chaotic signal
Neural Network (NN) approaches, either using recurrent NNs (i.e., built to process time signals) or classical feed-forward NNs that receive as input part of the past data and try to predict a point in the future; the advantage of the latter is that recurrent NNs are known to have a problem with taking into account the distant past
In my opinion for financial data analysis it is important to obtain not only a best-guess extrapolation of the time series, but also a reliable confidence interval, as the resulting investment strategy could be very different depending on that. Probabilistic methods, like Gaussian Processes, give you that "for free", as they return a probability distribution over possible future values. With classical statistical methods you'll have to rely on bootstrapping techniques.
There are many Python libraries that offer statistical and Machine Learning tools, here are the ones I'm most familiar with:
NumPy and SciPy are a must for scientific programming in Python
There is a Python interface to R, called RPy
statsmodel contains classical statistical model techniques, including autoregressive models; it works well with Pandas, a popular data analysis package
scikits.learn, MDP, MLPy, Orange are collections of machine learning algorithms
PyMC A python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo.
PyBrain contains (among other things) implementations of feed-forward and recurrent neural networks
at the Gaussian Process site there is a list of GP software, including two Python implementations
mloss is a directory of open source machine learning software
I've no idea about python libraries, but there are good forecasting algorithms in R which are open source. See the forecast package for code and references for time series forecasting.
Two approaches
There are two ways on how to deal with temporal structured input for classification, regression, clustering, forecasting and related tasks:
Dedicated Time Series Model: The machine learning algorithm incorporates such time series directly. Such a model is like a black box and it can be hard to explain the behavior of the model. Example are autoregressive models.
Feature based approach: Here the time series are mapped to another, possibly lower dimensional, representation. This means that the feature extraction algorithm calculates characteristics such as the average or maximal value of the time series. The features are then passed as a feature matrix to a "normal" machine learning such as a neural network, random forest or support vector machine. This approach has the advantage of a better explainability of the results. Further it enables us to use a well developed theory of supervised machine learning.
tsfresh calculates a huge number of features
The python package tsfresh calculate a huge number of such features from a pandas.DataFrame containing the time series. You can find its documentation at http://tsfresh.readthedocs.io.
Disclaimer: I am one of the authors of tsfresh.
Speaking only about the algorithms behind them, I recently used the double exponential smoothing in a project and it did well by forecasting new values when there is a trend in the data.
The implementation is pretty trivial, but maybe the algorithm is not sufficiently elaborated for your case.
Did you tried Autocorrelation for finding periodical patterns in time series ?
You can do that with numpy.correlate function.
Group method of data handling is widely used to forecast financial data.
If you want to understand Time Series Forecasting using Python then below link is very helpful.
https://github.com/ManojKumarMaruthi/Time-Series-Forecasting