I have monthly temperature data from several stations in eastern siberia. However, the one station which is necessary for my work is missing a lot of data, while other stations in the vicinity have good coverage. Is there a way to interpolate missing data based on the behavior of another dataset? Can't provide any code, since I don't know where to start and the datasets look like this:
The red dots is the data from the station with missing values while the green graph is from a station with good coverage
I would appreciate if anyone could point me in the right direction
There are methods to do this, for instance, apply a FFT on the dataset with good coverage and see how well it fits your dataset with poor coverage while removing high-frequency terms.
However, I highly doubt that this will be any useful: your dataset with high coverage fits almost perfectly your dataset with poor coverage. Whatever is the method you want to apply, the best function that resembles your dataset with high coverage while fitting your dataset with poor coverage, is the dataset with high coverage itself.
Let's create a trial dataset to work on your problem:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
t = np.linspace(0, 30*2*np.pi, 30*24*2)
td = pd.date_range("2020-01-01", freq='30T', periods=t.size)
T0 = np.sin(t)*8 - 15 + np.random.randn(t.size)*0.2
T1 = np.sin(t)*7 - 13 + np.random.randn(t.size)*0.1
T2 = np.sin(t)*9 - 10 + np.random.randn(t.size)*0.3
T3 = np.sin(t)*8.5 - 11 + np.random.randn(t.size)*0.5
T = np.vstack([T0, T1, T2, T3]).T
features = pd.DataFrame(T, columns=["s1", "s2", "s3", "s4"], index=td)
It looks like:
axe = features[:"2020-01-04"].plot()
axe.legend()
axe.grid()
Then if your time series linearly correlates well, you can simply predict missing values by the mean of Ordinary Least Square regression. SciKit-Learn provides a convenient interface to perform this kind of computations:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
# Remove target site from features:
target = features.pop("s4")
# Split dataset into train (actual data) and test (missing temperatures):
x_train, x_test, y_train, y_test = train_test_split(features, target, train_size=0.25, random_state=123)
# Create a Linear Regressor and train it:
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)
# Assess regression score with test data:
reg.score(x_test, y_test) # 0.9926150729585087
# Predict missing values:
ypred = reg.predict(x_test)
ypred = pd.DataFrame(ypred, index=x_test.index, columns=["s4p"])
The result looks like:
axe = features[:"2020-01-04"].plot()
target[:"2020-01-04"].plot(ax=axe)
ypred[:"2020-01-04"].plot(ax=axe, linestyle='None', marker='.')
axe.legend()
axe.grid()
error = (y_test - ypred.squeeze())
axe = error.plot()
axe.legend(["Prediction Error"])
axe.grid()
Related
I am doing a project and trying to show some BASIC elements of scikit in python. My goal is to create a 3ish simple examples and show how it learns and predicts. I am applying a simple sine wave type pattern and have been playing with a good example online from
https://mclguide.readthedocs.io/en/latest/sklearn/regression.html
My problem is that since I am new to this library and ML in general, I don't understand what I have in front of me and how to transform it into the output I am going for. The two problems I am struggling with is a linear regression on a sine wave and a guassian regression on a more complicated wave. The output I am getting per the article is the accuracy and that works like intended but what I am trying to get to is how to plot the predicted output on top of (or as an extension) of the training data to visually show how it did. I think the data is in here, I am either just using the wrong methods to return the appropriate information or I am not understanding how to extract the information from what is already being returned.
Here are some additional questions
I do not completely understand the "features = x[:, np.newaxis]" line
When plotting, what does '-*' and '-o'do? I looked through the documentation and it appears to be formatting but I couldn't find these two examples exactly.
What do I need to do to get access to the 20% predicted values so that I can plot it against the original?
Is there a simple way to apply the most amount of this code to apply to simple and gaussian examples?
Here is the skeletal code. Most of the scikit from the article is unchanged.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import random
from operator import add
N = 200 # 10 samples
randomlist = []
x = np.linspace(0, 12, N)
sine_wave = np.sin(1*x)
#plot the source data
plt.figure(figsize=(20,5))
plt.plot(x, sum_vector, 'o');
plt.show()
# convert features in 2D format i.e. list of list
# print('Before: ', x.shape)
features = x[:, np.newaxis]
# print('After: ', features.shape)
# save sine wave in variable 'targets'
targets = sine_wave
# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
features, targets,
train_size=0.8,
test_size=0.2,
# random but same for all run, also accuracy depends on the
# selection of data e.g. if we put 10 then accuracy will be 1.0
# in this example
random_state=23,
# keep same proportion of 'target' in test and target data
# stratify=targets # can not used for single feature
)
# training using 'training data'
regressor = LinearRegression()
regressor.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'training data'
prediction_training_targets = regressor.predict(train_features)
# note that 'score' uses 'feature and target (not predict_target)'
# for scoring in Regression
# whereas 'accuracy_score' uses 'features and predict_targets'
# for scoring in Classification
self_accuracy = regressor.score(train_features, train_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)
# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
# plot the predicted and actual target for test data
plt.figure(figsize=(20,5))
plt.plot(test_targets, color = "red")
plt.show()
plt.plot(prediction_test_targets, '-*', color = "red")
plt.plot(test_targets, '-o' )
plt.show()
I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.
I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!
This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.
I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
It's an old problem about prediction using regression exploring Gapminder data. They used "prediction space" to compute prediction.
Q1. Why should I be creating "prediction space"? What is the use of it?
Q2. The relation of computing predictions over the "prediction space"?
import numpy as np
import pandas as pd
# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')
The data seems like this;
Country,Year,life,population,income,region
Afghanistan,1800,28.211,3280000,603.0,South Asia
Slovak Republic,1960,70.47800000000001,4137224,8693.0,Europe & Central Asia
# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values
# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
I believe that you are taking a course from DataCamp
I stumbled upon this too, and the answer is prediction_space and y_pred are used to construct the straight line in the graph
NOTE: for those who are reading this and don't understand what I'm talking about, the code snippet is actually missing the graph plotting code
# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()
It comes with the y_pred to make a baseline for you to calculate the residuals and further get the R^2 value.
This is the data set from Kaggle's Titanic competition (train and test csv files). Each file has features of passengers such as ID, sex, age, etc. The train file has a "survived" column with 0 and 1 values. The test file is missing the survived column as it has to be predicted.
This is my simple code using random forest to give me a benchmark for the starter:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
ID_col = ['PassengerId']
target_col = ["Survived"]
cat_cols = ['Name','Ticket','Sex','Cabin','Embarked']
num_cols= ['Pclass','Age','SibSp','Parch','Fare']
other_col=['Type'] #Test and Train Data set identifier
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with -9999
fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Survived"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Survived"].values
x_test=test[list(features)].values
Train[list(features)]
#*************************
from sklearn import tree
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1]) #metrics. added by me
roc_auc = auc(fpr, tpr)
print(roc_auc)
final_status = rf.predict_proba(x_test)
test["Survived2"]=final_status[:,1]
test['my prediction']=np.where(test.Survived2 > 0.6, 1, 0)
test
As you can see, the final_status gives the probability of survival. I'm wondering how to get yes/no (1 or 0) answers from it. The easiest thing that I could think of was to say if probability is greater than 0.6 then the person survived and otherwise died ('my prediction' column) but once I submit the results, the predictions are not good at all.
I appreciate any insights. Thanks
Transforming your probability into binary output is the right way to go, but why did you choose > .6 and not > .5?
Also, if you are having bad results in that case, it is most likely because you did not do a proper job in data cleaning and feature extraction. For example, the title ("Mr", "Mrs",...) can give you indication on the gender, which is a super important feature to consider in your problem (I assume this is the titanic competition from kaggle).
I just needed to use a line like:
out = rf.predict(x_test)
and that would be the 0/1 answers I was looking for.