I have a dataset that I have run a K-means algorithm on (scikit-learn), and I want to build a decision tree on each cluster. I can recuperate the values from the cluster, but not the "class" values (I'm doing supervised learning, each element can belong to one of two classes and I need the value associated with the data to build my trees)
Ex: unfiltered data set:
[val1 val2 class]
X_train=[val1 val2]
y_train=[class]
The clustering code is this:
X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
y = clusterDF['Class']
(X_train, X_test, y_train, y_test) = train_test_split(X, y,
test_size=0.30)
kmeans = KMeans(n_clusters=3, n_init=5, max_iter=3000, random_state=1)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
And this is my (unbelievably clunky!) code for extracting the values to build the tree. The issue is the Y values; they aren't consistent with the X values
cl={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
for j in range(0,len(k_means_labels_unique)):
Xc=None
Y=None
#for i in range(0,len(k_means_labels_unique)):
indexes = cl.get(j,0)
for i, row in X.iterrows():
if i in indexes:
if Xc is not None:
Xc = np.vstack([Xc, [row['first occurrence of \'AB\''],row['similarity to \'AB\'']]])
else:
Xc = np.array([row['first occurrence of \'AB\''],row['similarity to \'AB\'']])
if Y is not None:
Y = np.vstack([Y, y[i]])
else:
Y = np.array(y[i])
Xc = pd.DataFrame(data=Xc, index=range(0, len(X)),
columns=['first occurrence of \'AB\'',
'similarity to \'AB\'']) # 1st row as the column names
Y = pd.DataFrame(data=Y, index=range(0, len(Y)),columns=['Class'])
print("\n\t-----Classifier ", j + 1,"----")
(X_train, X_test, y_train, y_test) = train_test_split(X, Y,
test_size=0.30)
classifier = DecisionTreeClassifier(criterion='entropy',max_depth = 2)
classifier = getResults(
X_train,
y_train,
X_test,
y_test,
classifier,
filename='classif'+str(3 + i),
)
Any ideas (or downright more efficient ways) of taking the clustered data to make a decision tree from?
Did not read all the code but my guess is that passing an index vector into the train_test_split function would help you keep track of the samples.
X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
y = clusterDF['Class']
indices = clusterDF.index
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, indices)
Related
I made a random forest regression with filter for y and x variables and I also wanted to add more about shapley values by creating a graph and table with column for the variable and column for the shaply value result. The code plots the graph, but the table is not showing.
So far my code looks like this:
x=widgets.SelectMultiple(
options=list(dataset.select_dtypes('number').columns),
disabled=False,
value=("NUMBER_SPOTS",)
)
def randomforest(y, x):
x = dataset[list(x)]
y = dataset[y]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
shap.initjs()
model = RandomForestRegressor(random_state=0).fit(X_train, y_train)
y_predict = model.predict(X_test)
mean_squared_error(y_test, y_predict)**(0.5)
print('Mean Squared Error:', mean_squared_error(y_test, y_predict)**(0.5))
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, features=X_train, feature_names=X_train.columns, plot_size=[15,8])
shap_vals = shap_values[0, :]
feature_importance = pd.DataFrame(list(zip(X_train.columns, shap_vals)), columns=['X_train', 'shap_vals'])
feature_importance.sort_values(by=['shap_vals'], ascending=False,inplace=True)
feature_importance
interact(randomforest, y = list(dataset.select_dtypes('number').columns), x = x)
I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?
I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred after prediction to check which datapoints have been misclassified.
Is it good practice for a polynomial regression with dates as x axis to convert the datetime values to numbers, from 1 to the len(dataframe)+1? Are the predicted values considered to be accurate?
data['numbered'] = ''
for i in range(1, len(data) + 1):
data.loc[i - 1, ['numbered']] = i
X = data[['numbered']].values
y = data[['ozone']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
``
I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset
I have used f_regression(X_train, Y_train) to select which feature should I use.
first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.
I have defined following functions to select features and missing values:
def select_features(target,df):
'''Get dataset and terget and print which features are important.'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies.dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)
f,pval = f_regression(X_train, Y_train)
inds = np.argsort(pval)[::1]
results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
print(results)
And I have defined following function to predict missing values.
def train(target,features,df,deg=1):
'''Get dataset, target and features and predict nan in target column'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies[[*features,target]].dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
pol = PolynomialFeatures(degree=deg)
X=X[features]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
# X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
X_train_n = pol.fit_transform(X_train)
reg = linear_model.Lasso()
reg.fit(X_train_n,Y_train);
X_test_n = pol.fit_transform(X_test)
Y_predtrain = reg.predict(X_train_n)
print('train',r2_score(Y_train, Y_predtrain))
Y_pred = reg.predict(X_test_n)
print('test',r2_score(Y_test, Y_pred))
# val
X_val_n = pol.fit_transform(X_val)
X_val_n.shape,X_train_n.shape,X_test_n.shape
Y_valpred = reg.predict(X_val_n)
print('val',r2_score(Y_val, Y_valpred))
X_names = X.columns.values
X_new = df_dummies[X_names].dropna()
X_new = X_new[df_dummies[target].isna()]
X_new_n = pol.fit_transform(X_new)
Y_new = df_dummies.loc[X_new.index,target]
Y_new = reg.predict(X_new_n)
Y_new = pd.Series(Y_new, index=X_new.index)
Y_new.head()
return Y_new, X_names, X_new.index
Then I am using these functions to fill nan for features with p_values<0.05.
But I am not sure is it a good way or not.
With this way many missing remain unpredicted.
I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:
from scipy.spatial import distance
def euc(a,b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self, x_test):
predictions = []
for row in x_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.x_train[0])
best_index = 0
for i in range(1, len(self.x_train)):
dist = euc(row, self.x_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
print(x_train.shape, x_test.shape)
my_classifier = ScrappyKNN()
my_classifier .fit(x_train, y_train)
prediction = my_classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))
Results are as follows:
(75, 4) (75, 4)
0.96
The 96% is the accuracy, but what exactly do the 75 and 4 represent?
You are printing the shapes of the datasets on this line:
print(x_train.shape, x_test.shape)
Both x_train and x_test seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.
From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5), what you are doing is to split the train and test data into 50% each. sklearn train-test-split actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4) are your number of rows, followed by the number of features in the train set and test set respectively.
Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.
This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test).)
TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).
Note: When performing train-test-split its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.