I am right now trying to make a simple program on random forest. Taking two sequences to train and predict and plot the final random forest curve.
But I am unable to do it as I cant understand which kind of sequence I should take and how to plot the random forest result on graph as we used to do in R language.
I have tried this as far as now -
import numpy as np
from pylab import *
test=np.random.rand(1000,10)
print (test)
train=np.random.rand(1000,5)
print (train)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100,n_jobs=10)
rfc.fit(test, train)
Kindly see the code and it would be a great help if you could correct the code and also show me how to plot for random forest result.
I am expecting your kind reply as soon as possible.
In R language, I did this -
simulate the data
train=rnorm(1,1000,.2)
predict=rnorm(1100,1200,.5)
df=data.frame(train, predict)
run the randomForest implementation
library(randomForest)
rf1 <- randomForest(predict~., data=df, mtry=2, ntree=500, importance=TRUE)
importance(rf1,type=1)
run the party implementation
library(party)
cf1 <- cforest(predict~.,data=df,control=cforest_unbiased(mtry=2,ntree=50))
varimp(cf1)
varimp(cf1,conditional=TRUE)
plots
plot (rf1, log = "y")
What is the expected meaning of the train and test variable?
THe documentation of RandomForestClassifier.fit tells that for a classifier you need to pass class labels for the second argument (named y in the documentation). This can be either integer values (on integer per possible class) or a list of string labels.
Also fit is expected to be called with training data only (training set input features and training set labels) so passing a variable named test is really confusion.
Please start by following one of the tutorials of scikit-learn to understand how to train a classifier with that library:
http://scikit-learn.org/stable/documentation.html
then read the documentation of the random forests in particular:
http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees
If you want to compute the variable importances, read this section in particular:
http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation
Related
Usually people use scikit-learn to train a model this way:
from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
It works fine as long as users' memory is large enough to accommodate the entire dataset. The dilemma for me is exactly this--the dataset is too big for my memory. My current solution is to enlarge the virtual memory of my machine and I have already made the system extremely slow by having too much virtual memory--so I start to think whether or not is it possible to feed the fit() method with samples in batches like this (and the answre is no, please keep reading and stop reminding me that the answer is no):
clf = gbc()
for i in range(X_train.shape[0]):
clf.fit(X_train[i], y_train[i])
so that I can read the training set from hard drive only when needed. I read the sklearn's manual and it seems to me that it does not support this:
Calling fit() more than once will overwrite what was learned by any previous fit()
So, is this possible?
This do not work in scikit-learn as explained in the comment section as well as in the documentation. However you can use river ( which is a python package for online/streaming machine learning). This package should be well-suited for you problematic.
Below is an example of training a LinearRegression using river.
from river import datasets
from river import linear_model
from river import metrics
from river import preprocessing
dataset = datasets.TrumpApproval()
model = (
preprocessing.StandardScaler() |
linear_model.LinearRegression(intercept_lr=.1)
)
metric = metrics.MAE()
for x, y, in dataset:
y_pred = model.predict_one(x)
# Update the running metric with the prediction and ground truth value
metric.update(y, y_pred)
# Train the model with the new sample
model.learn_one(x, y)
It is not clear in your question is which steps in the machine learning are slow for you. As also noted in the manual for riverml and this post in sklearn there is an option to do a partial fit. You will be restricted in terms of the models you can use for this incremental learning.
So using your example lets say we use a stochastic gradient descent classifier:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
X,y = make_classification(100000)
clf = SGDClassifier(loss='log')
all_classes = list(set(y))
for ix in np.split(np.arange(0,X.shape[0]),100):
clf.partial_fit(X[ix,:],y[ix],classes = all_classes)
After reading the section 6. Strategies to scale computationally: bigger data of the official manual mentioned by #StupidWolf in this post, I am aware that this question is more to this than meets the eye.
The real difficulty is about the design of a lot of models.
Take Random Forest as an example, one of the most important techniques used to improve its performance compared with the simpler Decision Tree is the application of bagging, which means that the algorithm has to pick some random samples from the entire dataset to construct several weak learners as the basis of the Random Forest. It means that feeding the model with one sample after another won't work with this design.
Although it is still possible for scikit-learn to define an interface for end-users to implement so that scikit-learn can pick a random sample by calling this interface and end-users will decide how their implementation of the interface is about to return the needed data by scanning the dataset on the hard drive, it becomes way more complicated than I initially thought and the performance gain may not be very significant given that the IO-heavy "full table scan" (in database's term) is frequently needed.
I have strain temperature data and I have read that article
https://www.idtools.com.au/principal-component-regression-python-2/
I'm trying to build a model and predict the strain out of the temperature.
I have got the following results with cross validation is negative.
I have the data set here
http://www.mediafire.com/file/r7dg7i9dacvpl2j/curve_fitting_ahmed.xlsx/file
My question is Is it results of Cross validation makes sense ?
My code is the following
The input is dataframe from panda.
def pca_analysis(temperature, strain):
# Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Data
print("process data")
T1 = temperature['T1'].tolist()
W_A1 = strain[0]
N = len(T1)
xData = np.reshape(T1, (N, 1))
yData = np.reshape(W_A1, (N, 1))
# Define the PCA object
pca = PCA()
Xstd = StandardScaler().fit_transform(xData)
# Run PCA producing the reduced variable Xred and select the first pc components
Xreg = pca.fit_transform(Xstd)[:, :2]
''' Step 2: regression on selected principal components'''
# Create linear regression object
regr = linear_model.LinearRegression()
# Fit
regr.fit(Xreg,W_A1)
# Calibration
y_c = regr.predict(Xreg)
# Cross-validation
y_cv = cross_val_predict(regr, Xreg, W_A1, cv=10)
# Calculate scores for calibration and cross-validation
score_c = r2_score(W_A1, y_c)
score_cv = r2_score(W_A1, y_cv)
# Calculate mean square error for calibration and cross validation
mse_c = mean_squared_error(W_A1, y_c)
mse_cv = mean_squared_error(W_A1, y_cv)
print(mse_c)
print(mse_cv)
print(score_c)
print(score_cv)
# Regression plot
z = np.polyfit(W_A1, y_c, 1)
with plt.style.context(('ggplot')):
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(W_A1, y_c, c='red', s = 0.4, edgecolors='k')
ax.plot(W_A1, z[1] + z[0] * yData, c='blue', linewidth=1)
ax.plot(W_A1, W_A1, color='green', linewidth=1)
plt.title('$R^{2}$ (CV): ' + str(score_cv))
plt.xlabel('Measured $^{\circ}$Strain')
plt.ylabel('Predicted $^{\circ}$Strain')
plt.show()
Here is the result of PCR
How would I improve the prediction using that data ?
enter image description here
From the Scikit Documentation, the value given by r2_score can be negative if your model is arbitrarily worse than random. Now, obviously this is not what one wants from using ML; you expect better than random results.
The first thing I would note is that your data seems like it may be quite nonlinear, in which case PCA struggles to improve model performance.
One potential substitute for PCA which accounts for essentially any nonlinearities in data is the use of autoencoders to preprocess data (Good article on these here). They can account for nonlinearities in data if you use non-linear activation functions on some of your hidden layers of the autoencoder, which may help your model's performance. There are many articles around the web that explain this, let me know if you want some resources if you so choose to pursue this course
The next thing that I would note is that r2_score is really not the best way to measure error, and that using mean-squared error is much more popular, especially for linear regression. So, if you want to keep your model as simple as this, I would simply ignore the r2_score and move on from there. However, that being said, linear regression is not equipped to solve very complex problems due to its simplicity, and judging by the picture you provided, it's pretty clear to me that linear regression is very rough when applied to this dataset.
I would be interested to know the difference in mean-squared-error between the PCA and non-PCA applied data. Here, the PCA should have less error than the the normal, non-PCA applied data. If it does not, then either your data is horribly non-linear (maybe?) or there is an error in your code (I looked over it and nothing is immediately obviously wrong with it). For linear regression, mean-squared-error is really almost the unanimous error function of choice, and is remarkably effective. Hope this answers your question, leave a comment/question about my answer if you have one and I will try to clarify as best as I can.
Also, while answering your question, I cam across this other question that I believe explains your problem pretty well (and uses some math, so be prepared). Most notably, there are situations where R^2 error is appropriate to use for your model, but given your results, I would say that R^2 error would probably be a poor choice of error function for this data.
Update: Given the values that you get for the mean squared error, my first guess would be that PCA is 1) either not working bc of the nature of the data, or 2) is implemented incorrectly. While I am not an expert with the libraries you are using, I would make sure that you transform all of the data in the same way, i.e. make sure that the PCA transformed vectors are being compared with transformed vectors.
For moving on from linear regression, I would investigate into making a simple neural network or SVR (this might be a little trickier). Both these methods are proven to work well for complex data and are very adaptable. There tons of resources online for both of these things, and I think giving specifics on implementation of either of these methods might be out of the scope of this question (you might have to ask a separate one about this).
Im trying to figure out how to configure a neural network using Neupy. The problem is that I cant seem to find much options for a GRNN, only the sigma value as described here:
There is a parameter, y_i, that I want to be able to adjust, but there doesn't seem to be a way to do it on the package. I'm parsing through the code but i'm not a developer so i've trouble following all the steps, maybe a more experienced set of eyes can find a way to tweak that parameter.
Thanks
From the link that you've provided it looks like y_i is the target variable. In your case it's your target training variable. In the neupy code it's used during the prediction. https://github.com/itdxer/neupy/blob/master/neupy/algorithms/rbfn/grnn.py#L140
GRNN uses lazy learning, which means that it doesn't train, it just re-uses all your training data per each prediction. The self.target_train variable is just a copy that you use during the training phase. You can update this value before making prediction
from neupy import algorithms
grnn = algorithms.GRNN(std=0.1)
grnn.train(x_train, y_train)
grnn.train_target = modify_grnn_algorithm(grnn.train_target)
predicted = grnn.predict(x_test)
Or you can use GRNN code for prediction instead of default predict function
import numpy as np
from neupy import algorithms
from neupy.algorithms.rbfn.utils import pdf_between_data
grnn = algorithms.GRNN(std=0.1)
grnn.train(x_train, y_train)
# In this part of the code you can do any moifications you want
ratios = pdf_between_data(grnn.input_train, x_test, grnn.std)
predicted = (np.dot(grnn.target_train.T, ratios) / ratios.sum(axis=0)).T
Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.
This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.
Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.
I've trained a Random Forest (regressor in this case) model using scikit learn (python), and I'would like to plot the error rate on a validation set based on the numeber of estimators used. In other words, there's a way to predict using only a portion of the estimators in your RandomForestRegressor?
Using predict(X) will give you the predictions based on the mean of every single tree results. There is a way to limit the usage of the trees? Or eventually, get each single output for each single tree in the forest?
Thanks to cohoz I've figured out how to do it.
I've written a couple of def, which turned out to be handy while plotting the learning curve of the random forest regressor on the test set.
## Error metric
import numpy as np
def rmse(train,test):
return np.sqrt(np.mean(pow(test - train+,2)))
## Print test set error
## Input the RandomForestRegressor, test set feature and test set known values
def rfErrCurve(rf_model,test_X,test_y):
p = []
for i,tree in enumerate(rf_model.estimators_):
p.insert(i,tree.predict(test_X))
print rmse(np.mean(p,axis=0),test_y)
Once trained, you can access these via the "estimators_" attribute of the random forest object.