I want to build a regression model with 2 output nodes using tensorflow. I search a code which can build regression model but with 1 output nodes.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/boston.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from sklearn import cross_validation
from sklearn import metrics
from sklearn import preprocessing
import tensorflow as tf
from tensorflow.contrib import learn
def main(unused_argv):
# Load dataset
boston = learn.datasets.load_dataset('boston')
x, y = boston.data, boston.target
# Split dataset into train / test
x_train, x_test, y_train, y_test = cross_validation.train_test_split(
x, y, test_size=0.2, random_state=42)
# Scale data (training set) to 0 mean and unit standard deviation.
scaler = preprocessing.StandardScaler()
x_train = scaler.fit_transform(x_train)
# Build 2 layer fully connected DNN with 10, 10 units respectively.
feature_columns = learn.infer_real_valued_columns_from_input(x_train)
regressor = learn.DNNRegressor(
feature_columns=feature_columns, hidden_units=[10, 10])
# Fit
regressor.fit(x_train, y_train, steps=5000, batch_size=1)
# Predict and score
y_predicted = list(
regressor.predict(scaler.transform(x_test), as_iterable=True))
score = metrics.mean_squared_error(y_predicted, y_test)
print('MSE: {0:f}'.format(score))
if __name__ == '__main__':
tf.app.run()
I am new to tensorflow, so I searched for the code which has similarity to how mine works, but the output of the code is one.
In my model, the input is N*1000, and the output is N*2. I wonder are there effective and efficient code for regression. Please give me some example.
Actually, I find a workable code using DNNRegressor:
import numpy as np
from sklearn.cross_validation import train_test_split
from tensorflow.contrib import learn
import tensorflow as tf
import logging
#logging.getLogger().setLevel(logging.INFO)
#Some fake data
N=200
X=np.array(range(N),dtype=np.float32)/(N/10)
X=X[:,np.newaxis]
#Y=np.sin(X.squeeze())+np.random.normal(0, 0.5, N)
Y = np.zeros([N,2])
Y[:,0] = X.squeeze()
Y[:,1] = X.squeeze()**2
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
train_size=0.8,
test_size=0.2)
reg=learn.DNNRegressor(hidden_units=[10,10])
reg.fit(X_train,Y_train[:,0],steps=500)
But, this code will work only if the shape of Y_train is N*1, and it will fail when the shape of Y_train is N*2.
However, I want to build a regress model and the input is N*1000, the output is N*2. And I can't fix it.
Related
I haven't been able to find any information on whether or not StackingCVClassifiers accept pre-trained models.
Probably not. StackedCVClassifiers and StackingClassifier currently take a list of base estimators, then apply fit and predict on them.
It's pretty straightforward to implement this though. The main idea behind stacking is to fit a "final model" using the predictions of earlier models.
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y)
Here X_train is (750, 100) and X_test is (250, 100).
We'll emulate "pre-trained" three models fit on X_train, y_train and produce predictions using the training set and the test set:
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor
# Emulate "pre-trained" models
models = [RidgeCV(), LassoCV(), KNeighborsRegressor(n_neighbors=5)]
X_train_new = np.zeros((X_train.shape[0], len(models))) # (750, 3)
X_test_new = np.zeros((X_test.shape[0], len(models))) # (250, 3)
for i, model in enumerate(models):
model.fit(X_train, y_train)
X_train_new[:, i] = model.predict(X_train)
X_test_new[:, i] = model.predict(X_test)
The final model is fit on X_train_new and can make predictions using (N, 3) matrices produced by our base models:
from sklearn.ensemble import GradientBoostingRegressor
clf = GradientBoostingRegressor()
clf.fit(X_train_new, y_train)
clf.score(X_test_new, y_test)
# 0.9998247
# split train test data
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
# import required modules and train the ML algorithm
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
i am getting the error-
Found input variables with inconsistent numbers of samples: [28, 7]
By salary csv data, i will take this one as example
https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
#Import the dataset
salary_data=pd.read_csv("/mnt/c/Users/XXXXXXX/Downloads/Salary_Data.csv")
# Here you can split your dataset between train and test using 80% for train
X_train, X_test, y_train, y_test = train_test_split(salary_data["YearsExperience"], salary_data["Salary"], test_size=0.2, random_state=1)
#Then you can fit your linear model on train dataset
#Here the goal is to modelize salary considering years of XP
regressor = LinearRegression()
model = regressor.fit(X_train.values.reshape(-1, 1),y_train.values.reshape(-1, 1))
#Let's plot our model prediction on whole data and compare to real data
plt.title("Salary/Years of XP")
plt.ylabel("Salary $")
plt.xlabel("Years")
plt.plot(salary_data["YearsExperience"],salary_data["Salary"],color="blue",label="real data")
plt.plot(salary_data["YearsExperience"],model.predict(salary_data["YearsExperience"].values.reshape(-1,1)),color="red",label="linear model")
plt.legend()
plt.show()
I'm completely unaware as to why i'm receiving this error. I am trying to implement XGBoost but it returns with error "ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric." Even after i've One Hot Encoded my categorical data. If anyone knows what is causing this and a possible solution i'd greatly appreciate it. Here is my code written in Python:
# Artificial Neural Networks - With XGBoost
# PRE PROCESS
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding Categorical Data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1, 2])],
remainder = 'passthrough')
X = np.array(ct.fit_transform(X), dtype = np.float)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 0)
# Fitting XGBoost to the training set
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(x_train, y_train)
# Predicting the Test set Results
y_pred = classifier.predict(x_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.
Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.
My current code is as follows. However, as mentioned above it only uses single iteration.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)
I am happy to provide more details if needed.
You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index] # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
X_test = X[test_index]
y_test = y[test_index] # See comment on ravel and y_train
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # Choose a model here
model.fit(X_train_oversampled, y_train_oversampled )
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
You can also, for example, append the scores to a list defined outside.
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
X_train, y_train = SMOTE().fit_sample(X_train, y_train)
....
I think you can also solve this with a pipeline from the imbalanced-learn library.
I saw this solution in a blog called Machine Learning Mastery https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
The idea is to use a pipeline from imblearn to do the cross-validation. Please, let me know if that works. The example below is with a decision tree, but the logic is the same.
#decision tree evaluated on imbalanced dataset with SMOTE oversampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define pipeline
steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
score = mean(scores))
I'm a beginner in machine learning and I want to build a model to predict the price of houses. I prepared a dataset by crawling a local housing website and it consists 1000 samples and only 4 features (latitude, longitude, area and number of rooms).
I tried RandomForestRegressor and LinearSVR models in sklearn, but I can't train the model properly and the MSE is super high.
MSE almost equals 90,000,000 (the true values of prices' range are between 5,000,000 - 900,000,000)
Here is my code:
import numpy as np
from sklearn.svm import LinearSVR
import pandas as pd
import csv
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
df = pd.read_csv('dataset.csv', index_col=False)
X = df.drop('price', axis=1)
X_data = X.values
Y_data = df.price.values
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=5)
rgr = RandomForestRegressor(n_estimators=100)
svr = LinearSVR()
rgr.fit(X_train, Y_train)
svr.fit(X_train, Y_train)
MSEs = cross_val_score(estimator=rgr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEsSVR = cross_val_score(estimator=svr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEs *= -1
RMSEs = np.sqrt(MSEs)
print("Root mean squared error with 95% confidence interval:")
print("{:.3f} (+/- {:.3f})".format(RMSEs.mean(), RMSEs.std()*2))
print("")
Is the problem with my dataset and count of features? How can I build a prediction model with this type of dataset?