Eliminating a certain column in each FOR loop - python

I have dataframe with 40 columns as inputs of algorithm. I initially create a for loop which predict the target variable in each iteration using only 1 of columns (For example column1 in first iteration, column2 in second and so on).
first I select one of the columns (named vapor) as the base of operation:
so the for loop above is iterate through columns and add them to vapor (constructing a dataframe with 2 columns) and predict the target variable (HI). based on the code below, I find the columns from all 40 which acquired better performance.
lets call this new best column 'wind'. So what I need help is how should I add another for loop so that It drop the for example 'wind' from X_new so that the shape of X_new become 39 instead of 40
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv (r'Input_hi.csv')
df = df.fillna(0)
X = df.drop(['HI', 'Station', 'lat', 'lon'],1)
y = df['HI']
# I think the for loop should be add here!
X_new = X.drop(['vapor'],1)
scaler = MinMaxScaler()
e = pd.DataFrame()
f = pd.DataFrame()
names = []
# So this Model will be updated in each for loop iteration
# and drop a column (the best in previous iteration)
for i in range(0, 40):
Model = X_new.iloc[:,i]
Model = pd.DataFrame(Model)
column_headers = list(Model)
names.append(column_headers)
e = pd.concat([first_model, Model], axis=1)
scaledX = scaler.fit_transform(e)
X_train, X_test, y_train, y_test = train_test_split(scaledX, y,
test_size=0.25,
random_state=4)
GBC = RandomForestRegressor()
GBC.fit(X_train, y_train)
GBCtest=GBC.predict(X_test)
GBCperform1=GBC.score(X_test, y_test)
A = abs(GBCperform1)
metr = pd.DataFrame({A})
f = pd.concat([f , metr], axis=0)
f.reset_index(drop=True, inplace=True)
names = pd.DataFrame(names)
data_2 = pd.concat([names , f], axis=1)
data_2.columns = ['Variable', 'R-Sq']
data_2 = data_2.sort_values(by="R-Sq",ascending=False)
data_2.reset_index(drop=True, inplace=True)
most_2 = data_2.Variable[0]
What I want in final is a dataframe with 40 rows (40 iteration of for loop), the name of best column in each for loop and corresponding metric (metr)

Related

Adding the dataframe name when looping over dataframes

I wrote the following code to loop the same function over different dataframes (named "Drought", "Flashflood",etc). I was happy to see that it worked, but I'm trying to determine how to get the names of the dataframes to append with the train and test scores. Can someone please guide me on what I'm missing here? If I do it like I currently have it, all the names post in each row at the bottom, but I only want the corresponding one. Similarly, the output I get appends each new array together, but my understanding was that append would just add a new item to a list?
For example I'm getting this as a result:
[(0.11995478823013683, -0.07264567664161303), (0.11998113643282327, -0.034458152253100005)]
But I would expect this:
[("Drought",0.11995478823013683, -0.07264567664161303)]
[("Flashflood",0.11998113643282327, -0.034458152253100005)]
Here's the code:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
df_list = [Drought, Flashflood, Flood, Gale]
names = ['Drought','Flashflood','Flood','Gale']
knn_r_acc = []
rmse_val = [] #to store rmse values for different dataframes
for df in df_list:
X = df[['Year.Month','IDH.M_2000','Population','IDH.M_2010']]
y = df['Deceased'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Scaling
scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(X_train)
x_train = pd.DataFrame(x_train_scaled)
x_test_scaled = scaler.fit_transform(X_test)
x_test = pd.DataFrame(x_test_scaled)
model = neighbors.KNeighborsRegressor(n_neighbors = 3, weights = 'uniform')
model.fit(x_train, y_train) #fit the model
pred=model.predict(x_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
#print('Model= ' , df, 'is:', error)
knn.fit(X_train,y_train)
test_score = knn.score(X_test,y_test)
train_score = knn.score(X_train,y_train)
#print(test_score)
#print(train_score)
knn_r_acc.append((names,train_score, test_score))
print(knn_r_acc)
In your case, names is actually a whole list / array.
You can implement it using an index variable. So before the loop starts, add:
name_index = 0
and inside the loop, append like so:
knn_r_acc.append((names[name_index], train_score, test_score))
name_index += 1

How to preserve the unique IDs of rows when doing machine learning?

I have a dataset X that contains an ID column, some other features, and a target column. I am doing a classification task, and after doing the classification on the test set, I want to see which ID belongs to which class.
So, I do the following:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('Dataset.csv')
X = df.drop(['ID', 'Target_Feature'], axis=1)
Y = df[['ID', 'Target_Feature']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
pol_ids = Y_test.ID ### Save the IDs of the test set to append to a new dataframe later
Y_train = Y_train.drop(['ID'], axis=1).values
Y_test = Y_test.drop(['ID'], axis=1).values
logReg = LogisticRegression()
logReg.fit(X_train, Y_train)
logReg.score(X_train, Y_train)
>>> 0.6300364252164744
predictions = logReg.predict(X_test)
predictions
>>> array([1, 0, 0, ..., 0, 1, 0], dtype=int64)
Then I do the following to construct a new dataframe with the ID column and the predictions:
y_pred = logReg.predict_proba(X_test)
df1 = pd.DataFrame(pol_ids)
df1 = df1.reset_index(drop=True)
df2 = pd.DataFrame(y_pred[:,1])
df1['Predictions']=df2
df1['Name']=df.loc[df1.index]['Name'].values ### This is one of the columns in the original dataframe
But, when I check the row in the original dataframe, df, for a given ID, its name is not the same in the new dataframe, df1. This means, most likely, that the IDs have not been correctly copied to the new dataframe.
So, how can I do that?
Check your last line
df1['Name']=df.loc[df1.index]['Name'].values
After reset_index , the index is change, so change to
df1['Name']=df.loc[pol_ids.index]['Name'].values

Preserving the index when selecting a slice of a pandas dataframe

So I am creating my training and test sets for use in a Multiple Linear Regression model using sklearn.
my dataset contains 182 features looks like the following;
id feature1 feature2 .... feature182 Target
D24352 145 8 7 1
G09340 10 24 0 0
E40988 6 42 8 1
H42093 238 234 2 1
F32093 12 72 1 0
I have then have the following code;
import pandas as pd
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
Once I use dataframe.iloc however, I loose my indexes (which I have set to be my IDs). I would like to keep these as I currently have no way of telling which records in my results relate to which records in my original dataset when I do the following step;
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
It looks like your data is stored as object type. You should convert it to float64 (assuming that all your data is of numeric type. Else only convert those rows, that you want to have as numeric type). Since it turns out your index is of type string, you need to set the dtype of your dataframe after setting the index (and generating the dummies). Again assuming that the rest of your data is of numeric type:
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
dataset0 = dataset0.astype(np.float64) # add this line to explicitly set the dtype
Now you should be able to just leave out values when slicing the DataFrame:
y = dataset0.iloc[:, 31:32]
dataset2.pop('Target')
X = dataset2.iloc[:, :180]
With .values you access the underlying numpy arrays of the DataFrame. These do not have an index column. Since sklearn is, in most cases, compatible with pandas, you can simply pass a pandas DataFrame to sklearn.
If this does not work, you can still apply reset_index to your DataFrame. This will add the index as a new column, which you will have to drop when passing the training data to sklearn:
dataset0.reset_index(inplace=True)
dataset2.reset_index(inplace=True)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.drop('index', axis=1), y_train.drop('index', axis=1))
y_pred = regressor.predict(X_test.drop('index', axis=1))
In this case you'll still have to change the slicing [:, 31:32] and [:, :180] to the correct columns, so that the index will be included in the slice.

Scikit Learn Tree-based feature selection keeping the columns name?

i want to make a selection of features tree-based.
My dataset has about 30 columns and after doing, there are about 5.
Which for me is great, the problem i have, is that the dataset of 5 columns that i get, does not keep the names of the columns and i can not identify them.
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
data = pd.read_csv(file)
X = data.drop('target', 1)
y = data['target']
X.shape #(100000, 30)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape #(100000, 5)
Can someone help me please?
Now when I'm more sure of the answer, please try the following:
mask = model.get_support(indices=False) # this will return boolean mask for the columns
X_new = X.loc[:, mask] # the sliced dataframe, keeping selected columns
featured_col_names = X_new.columns # columns name index
If all you need is just the column names:
X.columns[model.get_support()]

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.
Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)
This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']
First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Categories

Resources