Im trying to figure out how to configure a neural network using Neupy. The problem is that I cant seem to find much options for a GRNN, only the sigma value as described here:
There is a parameter, y_i, that I want to be able to adjust, but there doesn't seem to be a way to do it on the package. I'm parsing through the code but i'm not a developer so i've trouble following all the steps, maybe a more experienced set of eyes can find a way to tweak that parameter.
Thanks
From the link that you've provided it looks like y_i is the target variable. In your case it's your target training variable. In the neupy code it's used during the prediction. https://github.com/itdxer/neupy/blob/master/neupy/algorithms/rbfn/grnn.py#L140
GRNN uses lazy learning, which means that it doesn't train, it just re-uses all your training data per each prediction. The self.target_train variable is just a copy that you use during the training phase. You can update this value before making prediction
from neupy import algorithms
grnn = algorithms.GRNN(std=0.1)
grnn.train(x_train, y_train)
grnn.train_target = modify_grnn_algorithm(grnn.train_target)
predicted = grnn.predict(x_test)
Or you can use GRNN code for prediction instead of default predict function
import numpy as np
from neupy import algorithms
from neupy.algorithms.rbfn.utils import pdf_between_data
grnn = algorithms.GRNN(std=0.1)
grnn.train(x_train, y_train)
# In this part of the code you can do any moifications you want
ratios = pdf_between_data(grnn.input_train, x_test, grnn.std)
predicted = (np.dot(grnn.target_train.T, ratios) / ratios.sum(axis=0)).T
Related
I tried using pycaret for a machine learning project and got very high accuracies. When I tried to verify these using my sklearn code I found that I could not get the same numbers. Here is an example where I reproduce this issue on the public poker dataset from pycaret:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pycaret.classification import *
from pycaret.datasets import get_data
data = get_data('poker')
grid = setup(data=data, target='CLASS', fold_shuffle=True, session_id=2)
dt = create_model('dt')
This gives an accuracy using 10-fold cross validation of about 57%. When I try to reproduce this number using sklearn on the same dataset with the same model I get only 49%. Does anyone understand where this difference comes from??
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
X = data.drop('CLASS', axis = 1)
y = data['CLASS']
y_pred_cv = cross_val_predict(dt, X, y, cv=10)
accuracy_score(y, y_pred_cv)
0.4911698233964679
I think the difference could be due to how your CV folds are being randomized. Did you set the same seed (2) in sklearn? Is the shuffle parameter used in Kfolds set the same?
I had some trouble validating the results from PyCaret myself. I see two options you can try to validate the results:
Is your data correlated in some way? You are using sklearn.model_selection.cross_val_predict and specify cv=10. This means that (stratified) k-fold cross-validation is used to generate your folds. In either case, these splitters are instantiated with shuffle=False. If your data is correlated, this may explain the higher accuracy that you observe. You want to set shuffle=True.
PyCaret by default makes a 70%/30% train/test split. If you use its create_model method, then the cross-validation is done using the train set only. In your validation you use 100% of the data. This might alter the results a bit but I doubt it explains the gap that you observe.
The parameters could be the same but did you reproduce all features engineering inside the setup ? (feature selection, collinearity, normalisation, etc... )
Usually people use scikit-learn to train a model this way:
from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
It works fine as long as users' memory is large enough to accommodate the entire dataset. The dilemma for me is exactly this--the dataset is too big for my memory. My current solution is to enlarge the virtual memory of my machine and I have already made the system extremely slow by having too much virtual memory--so I start to think whether or not is it possible to feed the fit() method with samples in batches like this (and the answre is no, please keep reading and stop reminding me that the answer is no):
clf = gbc()
for i in range(X_train.shape[0]):
clf.fit(X_train[i], y_train[i])
so that I can read the training set from hard drive only when needed. I read the sklearn's manual and it seems to me that it does not support this:
Calling fit() more than once will overwrite what was learned by any previous fit()
So, is this possible?
This do not work in scikit-learn as explained in the comment section as well as in the documentation. However you can use river ( which is a python package for online/streaming machine learning). This package should be well-suited for you problematic.
Below is an example of training a LinearRegression using river.
from river import datasets
from river import linear_model
from river import metrics
from river import preprocessing
dataset = datasets.TrumpApproval()
model = (
preprocessing.StandardScaler() |
linear_model.LinearRegression(intercept_lr=.1)
)
metric = metrics.MAE()
for x, y, in dataset:
y_pred = model.predict_one(x)
# Update the running metric with the prediction and ground truth value
metric.update(y, y_pred)
# Train the model with the new sample
model.learn_one(x, y)
It is not clear in your question is which steps in the machine learning are slow for you. As also noted in the manual for riverml and this post in sklearn there is an option to do a partial fit. You will be restricted in terms of the models you can use for this incremental learning.
So using your example lets say we use a stochastic gradient descent classifier:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
X,y = make_classification(100000)
clf = SGDClassifier(loss='log')
all_classes = list(set(y))
for ix in np.split(np.arange(0,X.shape[0]),100):
clf.partial_fit(X[ix,:],y[ix],classes = all_classes)
After reading the section 6. Strategies to scale computationally: bigger data of the official manual mentioned by #StupidWolf in this post, I am aware that this question is more to this than meets the eye.
The real difficulty is about the design of a lot of models.
Take Random Forest as an example, one of the most important techniques used to improve its performance compared with the simpler Decision Tree is the application of bagging, which means that the algorithm has to pick some random samples from the entire dataset to construct several weak learners as the basis of the Random Forest. It means that feeding the model with one sample after another won't work with this design.
Although it is still possible for scikit-learn to define an interface for end-users to implement so that scikit-learn can pick a random sample by calling this interface and end-users will decide how their implementation of the interface is about to return the needed data by scanning the dataset on the hard drive, it becomes way more complicated than I initially thought and the performance gain may not be very significant given that the IO-heavy "full table scan" (in database's term) is frequently needed.
I'm working with Tensorflow but I'm pretty new to Python and machine learning. If I have a tensor of an image from my input pipeline what would be the best way to train it? Like in the basics, how would I handle passing trough data? I have structure I would like to use (I know I can get certain data from certain things like tensors) but I'm just not sure how to do so.
I'm very new to this so all help would be greatly appreciated.
def model(image_tensor):
tf.summary.image(img)
return predictions
def loss(predictions, labels):
return some_loss
def train(some_loss):
return train_op
Tensorflow may be a bit complicated for someone new to machine learning and python. My advice is to go through the excellent notebook tutorials that exist on tensorflow sites and start to understand the abstraction.
However, before that, I would use python with numpy (and sometimes scipy) to implement basic machine methods like Stochastic Gradient Descent just to ensure that you understand how the algorithms work. Then implement a simple logistic regression.
So why do I ask you to do all that? Well, because once you get a good handle of how to work with machine learning algorithm and how tedious it can be find the gradients, you will understand why tensorflow abstraction is useful.
I'm going to provide you with some simple examples dealing with MNIST.
from sklearn.datasets import load_digits
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
mnist = load_digits(2)
print("y [shape: {}] {}] : {}".format(y.shape,y[:10]))
print("x [shape: {}] {}]".format(x.shape)
What i've essentially done above is load two digits from the MNIST dataset (0 and 1) and display the array for the vector y and matrix (x).
If you want to see how the images look you can plt.imshow(X[0].reshape([8,8]))
The next step is to start defining our placeholder and variables
input_x = tf.placeholder(tf.float32,shape=[None,X.shape[1]], name = "input_x")
input_y = tf.placeholder(tf.float32,shape=[None,],name = "labels")
weights = tf.Variable(initial_value = tf.zeros(shape=[X.shape[1],1]), name="weights")
b = tf.Variable(initial_value=0.0, name = "bias")
We have done here is defined two placeholder in tensorflow and have told what the variables should expect as an input. I also gave the placeholder a name for debugging purpose.
prediction_y = tf.squeeze(tf.nn.sigmoid(tf.add(tf.matmul(input_x,weights),tf.cast(b,tf.float32))))
loss = tf.losses.log_loss(input_y,prediction_y)
optimizer = tf.train.Adamoptimizer(0.001).minimize(loss)
There you go, that's a logistic regression in tensorflow. What the last block does is apply the activation function to our input vectors, defines the loss function and then defines an optimizer for the loss function.
The final step is to run it.
from sklearn.metrics import roc_auc_score
s.run(tf.global_variables_initializer())
for i in range(10):
s.run(optimizer,{input_X:X_train, input_y: y_train})
loss_i = s.run(loss, {input_x:x_train,input_y:y_train})
print("loss at iteration {}: {}".format(i, loss_i))
That's essentially how you run your data through tensorflow. This code may have typos, I don't have python on this machine so i'm writing based on memory. However the basic idea is there. Hope this helps.
Edit: Also since you asked best way to train image data. My answer to you would be there isn't a "best". Building a CNN is a typical approach that you may want to experiment using assuming you have large number of classified images. Prior to that people also used support vectors relatively well for classifying images.
I am relatively new to logistic regression using SciKit learn in Python. After reading some topics and viewing some demo's, I decided to dive in myself.
So, basically, I am trying to predict the conversion rate of customers, based on some features. The outcome is either Active (1) or Not active (0). I tried KNN and logistic regression. With KNN I get an average accuracy of 0.893 and with logistic regression 0.994. The latter seems so high, is that even realistic / possible?
Anyway: Suppose that my model is indeed very accurate, I would now like to import a new dataset with the same feauture columns and predict their conversions (they end this month). In the case above I used cross_val_score to get the accuracy scores.
Do I now need to import the new set, somehow fit that new set to this model. (not training it again, now I just want to use it)
Can someone please inform me how I can proceed? If additional info is needed, please comment on that.
Thanks in advance!
For the statistic question: of course, it can happen, either your data is having little noise or the scenario Clock Slave mentioned in the comments.
For the import of the classifier, you could pickle it ( save it as a binary with the pickle module, and then just load it whenever you need it and use the clf.predict() method on the new data
import pickle
#Do the classification and name the fitted object clf
with open('clf.pickle', 'wb') as file :
pickle.dump(clf,file,pickle.HIGHEST_PROTOCOL)
And then later you can load it
import pickle
with open('clf.pickle', 'rb') as file :
clf =pickle.load(file)
# Now predict on the new dataframe df as
pred = clf.predict(df.values)
Beside 'Pickle', 'joblib' can be used as well.
##
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
assume there X,Y, already defined
model = LogisticRegression()
model.fit(X, Y)
save the model to disk
filename = 'finalized_model.sav'
joblib.dump(model, filename)
load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
I've trained a Random Forest (regressor in this case) model using scikit learn (python), and I'would like to plot the error rate on a validation set based on the numeber of estimators used. In other words, there's a way to predict using only a portion of the estimators in your RandomForestRegressor?
Using predict(X) will give you the predictions based on the mean of every single tree results. There is a way to limit the usage of the trees? Or eventually, get each single output for each single tree in the forest?
Thanks to cohoz I've figured out how to do it.
I've written a couple of def, which turned out to be handy while plotting the learning curve of the random forest regressor on the test set.
## Error metric
import numpy as np
def rmse(train,test):
return np.sqrt(np.mean(pow(test - train+,2)))
## Print test set error
## Input the RandomForestRegressor, test set feature and test set known values
def rfErrCurve(rf_model,test_X,test_y):
p = []
for i,tree in enumerate(rf_model.estimators_):
p.insert(i,tree.predict(test_X))
print rmse(np.mean(p,axis=0),test_y)
Once trained, you can access these via the "estimators_" attribute of the random forest object.