I'm trying to apply baseline model to my data set. But the data set is imbalanced and only 11% of the data belongs to positive category. I split the data without sampling, the recall for positive records is very low. I want to balance the training data(0.5 negative 0.5 positive) without balancing testing data. Does anyone know how to do that?
#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)
##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]
#Function attributes
#dataframe - processed dataframe
#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)
#testing_x - predictor variables dataframe(testing)
#training_y - target variable(training)
#training_y - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)
#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
training_y,testing_y,cols,cf,threshold_plot) :
#model
algorithm.fit(training_x,training_y)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#coeffs
if cf == "coefficients" :
coefficients = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
coefficients = pd.DataFrame(algorithm.feature_importances_)
column_df = pd.DataFrame(cols)
coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
You have to way of balancing data : up sampling or down sampling.
Up sampling : duplication of the under-represented data.
Down sampling : sampling of the over-represented data.
For the upsampling it is pretty much easy.
For the downsampling you can use sklearn.utils.resample and provide the number of sample you want to get.
Please note that as #paritosh-singh mentioned, changing the distribution may not be the only solution. There are machine learning algorithms that can:
- support imbalanced data
- already have built-in weighting option to takes in account the data distribution
Related
So my understanding is that SPE is the reconstruction error while using PCA(principle component analysis). Therefore when I obtained my loading matrix from the training data and use it to calculate SPE for both training data and validation data, the SPE for validation data should in general always bigger or close to the SPE for training data. However, the result I get has validation SPE sometimes smaller than training SPE.
Below is my code in python. The PCA function was from sklearn. train_x and valid_x are standardized dataset by the mean and stdev of train_x. P50_cols is the number of total columns.
Is there anything wrong with my code?
pca = PCA(n_components=P50_cols)
pca.fit(train_x)
sumratio = 0
k = 0
for k, r in enumerate(pca.explained_variance_ratio_):
sumratio += r
if sumratio > 0.9:
break
num_component = k+1
pca = PCA(n_components=num_component)
pca.fit(train_x)
train_x_restore = pca.inverse_transform((pca.transform(train_x)))
valid_x_restore = pca.inverse_transform((pca.transform(valid_x)))
spe_train = np.sum((train_x_restore-train_x.values)**2, 1)
spe_valid = np.sum((valid_x_restore-valid_x.values)**2, 1)
I have created a KNN model in Python (Module = Scikitlearn) by using three variables (Age, Distance, Travel Allowance) as my predictor variables, with the aim of using them to predict an outcome for the target variable (Method of Travel).
When constructing the model, I had to normalize the data for the three predictor variables (Age, Distance, Travel Allowance). This increased the accuracy of my model compared to not normalizing the data.
Now that I have constructed the model, I want to make a prediction. But how would I enter the predictor variables to make the prediction as the model has been trained on normalized data.
I want to enter KNN.predict([[30,2000,40]]) to carry out a prediction where Age = 30; Distance = 2000; Allowance = 40. But as the data has been normalized, I can't think of a way on how to do this. I used the following code to normalize the data:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
Actually, the answer is buried in the code you provided!
Once you fit the instance of preprocessing.StandardScaler() it remembers how to scale data. Try this
scaler = preprocessing.StandardScaler().fit(X)
# scaler is an object that knows how to normalize data points
X_normalized = scaler.transform(X.astype(float))
# used scalar to normalize the data points in X
# Note, this is what you have done, just in two steps.
# I just capture the scaler object
#
# ... Train your model on X_normalized
#
# Now predict
other_data = [[30,2000,40]]
other_data_normalized = scaler.transform(other_data)
KNN.predict(other_data_normalized)
Notice that I used scaler.transform twice in the same way
See StandardScaler.transform
I am learning h2o model predictions. When I do:
data_frame = h2o.H2OFrame(python_obj=data[1:], column_names=data[0])
data_train, data_valid, data_test = data_frame.split_frame(ratios= config.trainer_analizer_ratios, seed=config.trainer_analizer_seed)
# H2OGeneralizedLinearEstimator
allLog += "/n Starting H2OGeneralizedLinearEstimator"
model_gle = h2o.estimators.H2OGeneralizedLinearEstimator()
model_gle.train(x=predictors, y=response, training_frame= data_train, validation_frame= data_valid)
print(model_gle)
perf_gle = model_gle.model_performance(test_data= data_test)
print("GBM Precision:",perf_gle)
I get the following output
** Reported on test data. **
MSE: 494.25950875189955
RMSE: 22.23194792976764
MAE: 17.380709221249717
RMSLE: 1.217426465475652
R^2: 0.04331665117221439
Mean Residual Deviance: 494.25950875189955
Null degrees of freedom: 1177
Residual degrees of freedom: 1174
Null deviance: 608812.1064795277
Residual deviance: 582237.7013097376
AIC: 10660.224689554776
Why don't I get the ACU metric? I need that to score different models.
The GLM algo thinks you are solving a regression problem. You need to specify that you are solving a classification problem. You can do this with the family parameter (please see the documentation for an example) and possibly you need to set your target to type enum using the asfactor() method.
For your convenience here is the example code snippet that the link points to:
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8])
# try using the `family` parameter:
# Initialize and train a GLM
cars_glm = H2OGeneralizedLinearEstimator(family = 'binomial')
cars_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the auc for the validation data
cars_glm.auc(valid = True)
I'm getting drastically different F1 scores with the same input data with scikit-learn and caret. Here's how I'm running a GBM model for each.
scikit-learn (F1 is default output)
est = GradientBoostingClassifier(n_estimators = 4000, learning_rate = 0.1, max_depth = 5, max_features = 'log2', random_state = 0)
cv = StratifiedKFold(y = labels, n_folds = 10, shuffle = True, random_state = 0)
scores = cross_val_score(est, data, labels, scoring = 'f1', cv, n_jobs = -1)
caret (F1 must be defined and called):
f1 <- function(data, lev = NULL, model = NULL) {
f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
c("F1" = f1_val)
}
set.seed(0)
gbm <- train(label ~ .,
data = data,
method = "gbm",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = f1, classProbs = TRUE),
metric = "F1",
verbose = FALSE)
From the above code, I get an F1 score of ~0.8 using scikit-learn and ~0.25 using caret. A small difference might be attributed to algorithm differences, but I must be doing something wrong with the caret modeling to get the massive difference I'm seeing here. I'd prefer not to post my data set, so hopefully the issue can be diagnosed from the code. Any help would be much appreciated.
GBT is an ensemble of decision trees. The difference comes from:
The number of decision trees in the ensemble (n_estimators = 4000 vs. n.trees = 100).
The shape (breadth, depth) of individual decision trees (max_depth = 5 vs. interaction.depth = 1).
Currently, you're comparing the F1 score of a 100 MB GradientBoostingClassifier object with a 100 kB gbm object - one GBT model contains literally thousands of times more information than the other.
You may wish to export both models to the standardized PMML representation using sklearn2pmml and r2pmml packages, and look inside the resulting PMML files (plain text, so can be opened in any text editor) to better grasp their internal structure.
In the program, I am scanning a number of brain samples taken in a time series of 40 x 64 x 64 images every 2.5 seconds. The number of 'voxels' (3D pixels) in each image is thus ~ 168,000 ish (40 * 64 * 64), each of which is a 'feature' for an image sample.
I thought of using Recursive Feature Elimination (RFE). Then follow this up with Principle Component Analysis (PCA) because of the rediculously high n to perform dimensionality reduction.
There are 9 classes to predict. Thus a multi class classification problem. Starting with RFE:
estimator = SVC(kernel='linear')
rfe = RFE(estimator,n_features_to_select= 20000, step=0.05)
rfe = rfe.fit(X_train,y_train)
X_best = rfe.transform(X_train)
Now perform PCA :
X_best = scale(X_best)
def get_optimal_number_of_components():
cov = np.dot(X_best,X_best.transpose())/float(X_best.shape[0])
U,s,v = svd(cov)
print 'Shape of S = ',s.shape
S_nn = sum(s)
for num_components in range(0,s.shape[0]):
temp_s = s[0:num_components]
S_ii = sum(temp_s)
if (1 - S_ii/float(S_nn)) <= 0.01:
return num_components
return s.shape[0]
n_comp = get_optimal_number_of_components()
print 'optimal number of components = ', n_comp
pca = PCA(n_components = n_comp)
pca = pca.fit(X_best)
X_pca_reduced = pca.transform(X_best)
Train the reduced component dataset with SVM
svm = SVC(kernel='linear',C=1,gamma=0.0001)
svm = svm.fit(X_pca_reduced,y_train)
Now transform the training set to RFE-PCA reduced and make the predictions
X_test = scale(X_test)
X_rfe = rfe.transform(X_test)
X_pca = pca.transform(X_rfe)
predictions = svm.predict(X_pca)
print 'predictions = ',predictions
print 'actual = ',y_test
I trained it for a subset of my data and got 76.92%. I'm not too worried about the low number because it is trained only against 1/12 of my dataset.
I tried doubling the training size and get 92% accuracy. This is pretty good. But then I trained against the entire dataset and saw an accuracy of 92.5%
So I got a 0.5% increase in accuracy for 6 times the dataset increase. Furthermore, the data samples aren't noisy. So nothing is wrong with the samples.
Also, for 1/12 th the dataset training size, I get the same 76.92% when I choose n_features_to_select = 1000. (The same for 20000!!) while performing RFE. There must be something wrong here. Why do I get the same performance when selecting such a less number of features?