Index of Data after train and test split - python

guys, I'm new to Data science and Python. I'm working on a regression problem. My question is when I'm trying to plot my test part of target variable im getting strange type of plot
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target =
train_test_split(features, target, test_size = 0.25, random_state = 42)
# Remove the labels from the dataset
plt.xlim(0,100)
plt.plot(test_target , 'g');
is it because of random indexes attached to test_target..?
how can i get continous curve like this

If index of the data is the problem then use:
df_train = df_train.reset_index()
If you want to reset and set it to another column of df lets say "A" then do:
df_train = df_train.reset_index().set_index('A')

Related

Collate probabilities, predictions, coefficients from multiple samples of data using sklearn

I would like to combine model probabilities for class 1 predictions for ALL rows from multiple (random) splits/samples of data into a single dataframe in python.
I realize that not all rows will be selected in each split, but if data sampling is replicated enough times, each row will have been selected a few times at least and model probabilities generated.
My current approach basically creates multiple test-train splits (5 in example below), and collates probabilities from each training instance into a single dataframe as shown in below code with a mock dataset:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
# start by creating the first column of probs table
probs_table = pd.DataFrame(X.index, columns=["members"])
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.2, random_state=state)
pd.DataFrame(log.predict_proba(test_x)[:, 1]) #fit final model
probs_table[f"iter_{i+1}"] = pd.DataFrame(log.predict_proba(test_x)[:, 1])
probs_table
Unfortunately, I am not getting probabilities for all rows in the dataframe. Can somebody please guide me to the solution to this problem? And it would be ideal to include additional model outputs such as predictions, coefficientts for each iteration/data row.
Any other way to sample the data (i.e., other than test-train splitting) is fine as well as long as probabilities can be assembled for all dataframe rows.
There are a couple problems with the code as is:
.fit() is never called here. I'm assuming you'd like it fit right after the train/test split line and before the predict_proba() call?
When you place the values into the dataframe, you're creating a new column and I assume you want one column for all iterations while keeping track of which iteration it came from in each column?
Here is code that I believe accomplishes what you'd like. It 1) loops over each random state integer, 2) creates a new train/test split, 3) fits a new model each time, and 4) predicts on each test set row.
I also have it keep track of the original index so you can see how many times each original row ends up in the prediction data frame:
EDIT: Include the coefficients as a column
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
dfs = []
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.2, random_state=state)
log.fit(train_x, train_y)
preds = log.predict_proba(test_x)[:, 1]
orig_indices = test_x.index
df = pd.DataFrame(data={
"orig_index": orig_indices,
"prediction": preds,
"iteration": f"iter_{i+1}",
"coefficients": [log.coef_[0]] * len(preds)})
dfs.append(df)
probs_table = pd.concat(dfs)
probs_table

Random Forest on Panel Data using Python

So I am having some troubles running a random forest regression on panel data.
The data currently looks like this:
I want to conduct a random forest regression which predicts KwH for each ID over time based on the variables I have. I have split my data into training and test samples using the following code:
from sklearn.model_selection import train_test_split
X = df[['hour', 'day', 'month', 'dayofweek', 'apparentTemperature',
'summary', 'household_size', 'work_from_home', 'num_rooms',
'int_in_renew', 'int_in_gen', 'conc_abt_cc', 'feel_abt_lifestyle',
'smrt_meter_help', 'avg_gender', 'avg_age', 'house_type', 'sum_insul',
'total_lb', 'total_fridges', 'bigg_apps', 'small_apps',
'look_at_meter']]
y = df[['KwH']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
I then wish to train my model and test it against the testing sample however I am unsure of how to do this. I have tried this code:
from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=200)
rfc.fit(X_train, y_train)
However I get the following error message:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Im not sure if the error is fundamentally in the way my data is arranged or the way I am doing the random forest so any help with this and then testing the data against the test sample after would be greatly appreciated.
Thanks in advance.
Simply switching y = df[['KwH']] to y = df['KwH'] or y = df.KwH should solve this.
This is because scikit-learn doesn't expect y to be a dataframe, and selecting columns with the double [[...]] precisely is returning a dataframe.

How to make prediction on the new data in Pandas DataFrame with some extra columns?

I have used random forest classifier to build a model - model works fine I am able to output score as well as probability value on the train and test .
The challenge is :
I used 29 variables as features with 1 Target
When I score the X_Test it works fine
When I bring in a new data set which has 29 variables and my Unique ID /primary key - model errors out saying its looking for 29 variables
How do I retain my ID and get prediction for the new file ?
What I tried so far -
data = pd.read_csv('learn2.csv')
y=data['Target'] # Labels
X=data[[
'xsixn', 'xssocixtesDegreeOnggy', 'xverxgeeeouseeeoggdIncome', 'BxceeeggorsDegreeOnggy', 'Bggxckorxfricxnxmericxn',
'Ceeiggdrenxteeome', 'Coggggege', 'Eggementxry', 'GrxduxteDegree', 'eeigeeSceeoogg', 'eeigeeSceeooggGrxduxte', 'eeouseeeoggdsEst',
'MedixneeouseeeoggdIncome', 'NoVeeeicgges', 'Oteeerxsixn', 'OteeersRxces', 'OwnerOccupiedPercent', 'PercentBggueCoggggxrWorkers',
'PercentWeeiteCoggggxr', 'PopuggxtionEst', 'PopuggxtionPereeouseeeoggd', 'RenterOccupiedPercent', 'RetiredOrDisxbggePersons',
'TotxggDxytimePopuggxtion', 'TotxggStudentPopuggxtion', 'Unempggoyed', 'VxcxnteeousingPercent', 'Weeite', 'WorkpggxceEstxbggiseements'
]]
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% training
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
Predicting on new file:
data1=pd.read_csv('score.csv')
y_pred2=clf.predict(data2)
ValueError: Number of features of the model must match the input. Model n_features is 29 and input n_features is 30
You can exclude the 'ID' column while generating the predictions on new dataset using pandas difference function:
data1=pd.read_csv('score.csv')
For ease of further use I am storing the predictions in a new dataframe:
y_pred2 = pd.DataFrame(clf.predict(data1[data1.columns.difference(['ID'])]),columns = ['Predicted'], index = data1.index)
To map the predictions against the 'ID' use pd.concat:
pred = pd.concat([data1['ID'], y_pred2['Predicted']], axis = 1)

How to predict if number of features are not matching with number of features available in testset? [duplicate]

This question already has answers here:
Keep same dummy variable in training and testing data
(5 answers)
Closed 4 years ago.
I am using pandas get_dummies to convert categorical variables into dummy/indicator variables, it introduce new features in the dataset. Then we fit/train this dataset into a model.
Since the dimension of X_train and X_test remains the same, when we do prediction for test data it works well with test data X_test.
Now lets say we have test data in another csv file (with unknown output). When we transform this set of test data using get_dummies, the resulting dataset may not have same number of features as we have trained our model with. Later when we use our model with this dataset its failing, because number of feature in testing set is not matching with the model's.
Any idea how we can handle this?
Code :
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load the dataset
in_file = 'train.csv'
full_data = pd.read_csv(in_file)
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)
X_train, X_test, y_train, y_test = train_test_split(features, outcomes,
test_size=0.2, random_state=42)
model =
DecisionTreeClassifier(max_depth=50,min_samples_leaf=6,min_samples_split=2)
model.fit(X_train,y_train)
y_train_pred = model.predict(X_train)
#print (X_train.shape)
y_test_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
# DOing again to test another set of data
test_data = 'test.csv'
test_data1 = pd.read_csv(test_data)
test_data2 = pd.get_dummies(test_data1)
test_data3 = test_data2.fillna(0.0)
print(test_data2.shape)
print (model.predict(test_data3))
Seems a similar question has been asked before but the most efficient/easiest way would be to follow approach by Thibault Clement described here
# Get missing columns in the training test
missing_cols = set( X_train.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X_test = X_test[X_train.columns]
It's also worth noting that your model can only use the features it was trained on so if there are additional columns in X_test vs X_train rather than less then these will have to be removed before predicting.

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.
Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)
This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']
First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Categories

Resources