Using python generators in scikit-learn [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I was wondering whether and how it is possible to use a python generator as data input to scikit-learn classifier's .fit() functions? Due to huge amounts of data, this seems to make sense to me.
In particular I am about to implement a random forest approach.
Regards
K

The answer is "no". To do out of core learning with random forests, you should
Split your data into reasonably-sized batches (restricted by the amount of RAM you have; bigger is better);
train separate random forests;
append all the underlying trees together in the estimators_ member of one of the trees (untested):
for i in xrange(1, len(forests)):
forests[0].estimators_.extend(forests[i].estimators_)`
(Yes, this is hacky, but no solution to this problem has been found yet. Note that with very large datasets, it might pay to just sample a number training examples that fits in the RAM of a big machine instead of training on all of it. Another option is to switch to linear models with SGD, those implement a partial_fit method, but obviously they're limited in the kind of functions they can learn.)

The short answer is "No, you can't". Classical Random Forest classifier is not an incremental or online classifier, so you can't discard training data while learning, and have to provide all the dataset at once.
Due to popularity of RF in machine learning (not least because of the good prediction results for some interesting cases), there are some attempts to implement online variation of Random Forest, but to my knowledge those are not yet implemented in any python ML package.
See Amir Saffari's page for such an approach (not Python).

Related

Audio recognition and fingerprint using sklean & librosa [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I want to create a model that can predict who has speak with different word.
In this case i try to use feature
Mfcc
Melspectogram
Tempo
Chroma stft
Spectral Centroid
Spectral Bandwidth
Tempo
And for train that i am use RandomforestRegressor
It's possible to create model like that?
For the sound processing and feature extraction part, librosa is definitely going to provide you all you need.
For the machine learning part however, speaker identification (also called "voice recognition") is a relatively complex task. You probably will get more success using techniques from deep learning. You can certainly try to use random forests if you like, but you'll probably get a lower accuracy and will have to spend more time doing feature engineering. In fact, it will be a good exercise for you to compare the results you can get with the various techniques.
For an example tutorial on speaker identification using Keras, see e.g. this article.

Do we expect baseline (all features) and selected features to perform the same with decision trees? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm using sklearn's decision tree for a binary class problem. However it turns out that after optimizing everything (optimizing hyper parameters and using the optimal number of selected features), the best I can do is get an accuracy and f1-score that's as good as baseline (no feature selection and use all features).
Sure now it's less messy (less features), and the code runs faster. But is this expected? Or is the point of feature selection to improve performance metrics of the classifier?
That's right. Feature selection will mostly give you performance benefits and might help a little against overfitting if relevant. It's not really supposed to improve the training metrics as you are essentially trying to solve the same problem with less information in your hands.
It doesn't mean you shouldn't do it though. If you can achieve the same performance with fewer features - use fewer features :)

Does Spark improve performance with huge dataframe and machine learning algorithm that is not available in MLlib? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I’m training a machine learning model in python 3, but it’s taking long. I have a very large dataframe and the algorithm I’m using isn’t available in Spark MLlib. Is there any performance benefit in terms of training time by uploading my dataframe into Spark and using a non-MLlib algorithm?
In terms of manipulating the dataframe, I understand manipulating it will be faster, but if the algorithm isn’t distributed, I am not sure if it would speed up training. I’m new to Spark and am not sure if I’m understanding it correctly.
Yes, Spark could help with training a model, even if the model isn't part of the Spark standard library. It all depends on if you leverage the power of cluster computing when training the model. Suppose you have a 20 node i3.xlarge cluster (30.5 GB of RAM per node) and have all nodes crunching data in parallel to train your model. That's essentially a 610 GB supercomputer at your fingertips.
If you don't structure the code properly, you might accidentally perform all the computations on the driver node and only use one of the nodes in your cluster, leaving the other ones idle.
Spark is also powerful for running models on huge datasets. Suppose you have a Python model that takes a bunch of inputs and returns an output. Spark is a great way to run this model on lets say 50 billion rows of data.
Not sure why you're getting downvoted, this is an excellent question in my opinion.

Deep learning with data from simulations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
While reading the great book by F. Chollet, I'm experimenting with Keras / Tensorflow, on a simple Sequential model that I train on simulated images, which come from a physical analytical model.
Having full control of the simulations, I wrote a generator which produces an infinite stream of data and label batches, which I use with fit_generator in Keras. The data so generated are never identical, plus I can add some random noise to each image.
Now I'm wondering: is it a problem if the model never sees the same input data from one epoch to the next?
Can I assume my problems in getting the loss down are not due to the fact that the data are "infinite" (so I only have to concentrate on hyper parameters tuning)?
Please feel free if you have any advice for dealing with DL on simulated data.
A well trained network will pick up on patterns in the data, prioritizing new data over old. If your data comes from a constant distribution this doesn't matter, but if that distribution is changing over time it should adapt (slowly) to the more recent distribution.
The fact that the data is never identical does not matter. Most trained networks use some form of data augmentation (e.g. for image processsing, it is common for images to be randomly cropped, rotated, resized, and have color manipulations applied etc, so each example is never identical even if it comes from the same base image).

Is Tensorflow worth using for simple optimization problems? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have started learning Tensorflow recently and I am wondering if it is worth using in simple optimization problems (least squares, maximum likelihood estimation, ...) instead of more traditional libraries (scikit-learn, statsmodel)?
I have implemented a basic AR model estimator using Tensorflow with MLE and the AdamOptimizer and the results are not convincing either performance or computation speed wise.
What do you think?
This is somewhat opinion based, but Tensorflow and similar frameworks such as PyTorch are useful when you want to optimize an arbitrary, parameter-rich non-linear function (e.g., a deep neural network). For a 'standard' statistical model, I would use code that was already tailored to it instead of reinventing the wheel. This is true especially when there are closed-form solutions (as in linear least squares) - why go into to the murky water of local optimization when you don't have to? Another advantage of using existing statistical libraries is that they usually provide you with measures of uncertainty about your point estimates.
I see one potential case in which you might want to use Tensorflow for a simple linear model: when the number of variables is so big the model can't be estimated using closed-form approaches. Then gradient descent based optimization makes sense, and tensorflow is a viable tool for that.

Categories

Resources