I'm pulling data from a CSV into a DF, and running the code below....getting this error:
# Import the necessary packages
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
# Define a normalizer
normalizer = Normalizer()
# Fit and transform
norm_movements = normalizer.fit_transform(dfMod)
# Create Kmeans model
kmeans = KMeans(n_clusters = 10,max_iter = 1000)
# Make a pipeline chaining normalizer and kmeans
pipeline = make_pipeline(normalizer,kmeans)
# Fit pipeline to daily stock movements
pipeline.fit(dfMod)
labels = pipeline.predict(dfMod)
print(len(labels), len(dfMod))
df1 = pd.DataFrame({'labels':labels,'dfMod':list(dfMod)}).sort_values(by=['labels'],axis = 0)
# now...with PCA reduction
# Define a normalizer
normalizer = Normalizer()
# Reduce the data
reduced_data = PCA(n_components = 2)
# Create Kmeans model
kmeans = KMeans(n_clusters = 10,max_iter = 1000)
# Make a pipeline chaining normalizer, pca and kmeans
pipeline = make_pipeline(normalizer,reduced_data,kmeans)
# Fit pipeline to daily stock movements
pipeline.fit(dfMod)
# Prediction
labels = pipeline.predict(dfMod)
# Create dataframe to store companies and predicted labels
df2 = pd.DataFrame({'labels':labels,'dfMod':list(dfMod.keys())}).sort_values(by=['labels'],axis = 0)
This line throws the error.
df1 = pd.DataFrame({'labels':labels,'dfMod':list(dfMod)}).sort_values(by=['labels'],axis = 0)
The weird thing is, that this shows 50k and 50k.
print(len(labels), len(dfMod))
50000 50000
Am I missing something here? How can I make this work? Thanks!!
Related
i want to use principal component analysis-mutual information (PCA-MI) to have data representation from source which has source relevance of (value from smartinsole) and ouput variable (value from force plate). PCA was used to determine the principal component of Ni provided that the cumulative variance is greater than 98% of the source information measured from 89 insole sensors. MI is generally used in the selection of input variables for predictive models because it is a good indicator of the relationship between input variables and output variables. here I want to get results like a flowchart as below
then I try to make code like below. but I can't generate like what's in the flowchart
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# load the dataset
def load_dataset(filename):
# load the dataset as a pandas DataFrame
data = read_csv(filename, header=None)
# retrieve numpy array
dataset = data.values
y = dataset
return y
def load_dataset2(filename):
# load the dataset as a pandas DataFrame
data2 = read_csv(filename, header=None)
# retrieve numpy array
dataset2 = data2.values
X = dataset2
return X
# feature selection
def select_features(X_train, y_train, X_test):
# configure to select a subset of features
fs = SelectKBest(score_func=mutual_info_classif, k=4)
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# load the dataset
Insole = pd.read_csv('1119_Rwalk40s1_list.txt', header=None, low_memory=False)
SIData = np.asarray(Insole)
df = pd.read_csv('1119_Rwalk40s1.csv', low_memory=False)
columns = ['Fx','Fy','Fz','Mx','My','Mz']
selected_df = df[columns]
FCDatas = selected_df
SmartInsole = np.array(SIData)
FCData = np.array(FCDatas)
scaler_x = MinMaxScaler(feature_range=(0, 1))
scaler_x.fit(SmartInsole)
xscale = scaler_x.transform(SmartInsole)
scaler_y = MinMaxScaler(feature_range=(0, 1))
scaler_y.fit(FCData)
yscale = scaler_y.transform(FCData)
SIDataPCA = xscale
pca = PCA(n_components=89)
pca.fit(SIDataPCA)
SIdata_pca = pca.transform(SIDataPCA)
X = SIdata_pca
y = yscale
X = SIdata_pca
y = yscale
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LogisticRegression(solver='liblinear')
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
how can I get the correct PCA-MI result data?
I have a battery dataframe with rows representing various cycles and a set of features for that cycle:
As an example row 1:
df = pd.DataFrame(columns=['Ecell_V', 'I_mA', 'EnergyCharge_W_h', 'QCharge_mA_h',
'EnergyDischarge_W_h', 'QDischarge_mA_h', 'Temperature__C',
'cycleNumber', 'SOH', 'Cell'])
df.loc[0] = [3.730646, 2988.8713, 0.185061, 49.724845, 0.0, 0.0, 27.5, 2, 0.99, 'VAH11']
There are 600,000 rows
I am trying to predict the value for SOH as follows:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression # for building a linear regression model
from sklearn.svm import SVR # for building SVR model
from sklearn.preprocessing import MinMaxScaler
train_data = pd.read_csv("train_data.csv")
train_cell = train_data.pop('Cell')
# reduce size of df train for comp purposes
train_data = train_data.iloc[::20, :]
train_data = train_data.reset_index(drop=True)
#remove unwanted features
train_data.pop('Ns')
train_data.pop('time_s')
#scale the data
scaler = MinMaxScaler()
train_data_scaled = scaler.fit_transform(train_data)
#return to df
train_data_scaled = pd.DataFrame(train_data_scaled, columns=['Ecell_V', 'I_mA', 'EnergyCharge_W_h', 'QCharge_mA_h',
'EnergyDischarge_W_h', 'QDischarge_mA_h', 'Temperature__C',
'cycleNumber', 'SOH'])
train_data_scaled
#unscale target
train_data_scaled['SOH'] = train_data['SOH']
train_data_scaled
#split target and input
X = train_data_scaled.drop('SOH', axis=1)
y = train_data_scaled['SOH'].values
#model
model = SVR(kernel='rbf', C=100, epsilon=1)
svr = model.fit(X, y)
#predict model
pred = model.predict(X)
Now returning ``` pred `` gives the same prediction for each row:
array([0.89976814, 0.89976814, 0.89976814, ..., 0.89976814, 0.89976814,
0.89976814])
why is this happening?
Using StandardScaler() on the X and y data corrected this issue, with an inverse called to return it to original values.
I've written an ML model that will predict diseases when symptoms are entered. I'm not quite sure on how to get the final ML model output (prediction) to display in a ReactJS frontend. Below is my model:
from cgi import test
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import mode
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC
# from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import pickle
import requests
import json
# Reading the train.csv by removing the last column since it's an empty column
DATA_PATH = "D:/Final Year Project/Diabetdeck-V3/flask-server/dataset/Training.csv"
data = pd.read_csv(DATA_PATH).dropna(axis = 1)
# Checking whether the dataset is balanced or not
disease_counts = data["prognosis"].value_counts()
temp_df = pd.DataFrame({
"Disease": disease_counts.index,
"Counts": disease_counts.values
})
# plt.figure(figsize = (18,8))
# sns.barplot(x = "Disease", y = "Counts", data = temp_df)
# plt.xticks(rotation=90)
# plt.show()
# Encoding the target value into numerical value using LabelEncoder
encoder = LabelEncoder()
data["prognosis"] = encoder.fit_transform(data["prognosis"])
X = data.iloc[:,:-1]
y = data.iloc[:, -1]
X_training_data, X_testing_data, y_training_data, y_testing_data =train_test_split(
X, y, test_size = 0.2, random_state = 24)
# Defining scoring metric for k-fold cross validation
def cv_scoring(estimator, X, y):
return accuracy_score(y, estimator.predict(X))
# Initializing Models
models = {
"SVC":SVC(),
# "Gaussian NB":GaussianNB(),
"Logistic Regression":LogisticRegression(),
"Random Forest":RandomForestClassifier(random_state=18)
# "Linear Regression" :LinearRegression()
}
# Producing cross validation score for the models
for model_name in models:
model = models[model_name]
scores = cross_val_score(model, X, y, cv = 10,
n_jobs = -1,
scoring = cv_scoring)
# Training and testing SVM Classifier
svm_model = SVC()
svm_model.fit(X_training_data, y_training_data)
preds = svm_model.predict(X_testing_data)
# pickle.dump(svm_model, open('model.pkl','wb'))
# Training and testing Linear Regression
# lr_model = LinearRegression()
# lr_model.fit(X_training_data, y_training_data)
# preds = lr_model.predict(X_testing_data)
lr_model = LogisticRegression(C=0.1, penalty='l2', solver='liblinear')
lr_model.fit(X_training_data, y_training_data)
lr_model.score(X_training_data, y_training_data)
preds = lr_model.predict(X_testing_data)
# pickle.dump(lr_model, open('model.pkl','wb'))
# Training and testing Random Forest Classifier
rf_model = RandomForestClassifier(random_state=18)
rf_model.fit(X_training_data, y_training_data)
preds = rf_model.predict(X_testing_data)
# pickle.dump(rf_model, open('model.pkl','wb'))
# Training the models on whole data
final_svm_model = SVC()
final_lr_model = LogisticRegression()
final_rf_model = RandomForestClassifier(random_state=18)
final_svm_model.fit(X, y)
pickle.dump(svm_model, open('model.pkl','wb'))
final_lr_model.fit(X, y)
pickle.dump(lr_model, open('model.pkl','wb'))
final_rf_model.fit(X, y)
pickle.dump(rf_model, open('model.pkl','wb'))
# Reading the test data
test_data = pd.read_csv("D:/Final Year Project/Diabetdeck-V3/flask-server/dataset/Testing.csv").dropna(axis=1)
test_X = test_data.iloc[:, :-1]
test_Y = encoder.transform(test_data.iloc[:, -1])
# Making prediction by take mode of predictions made by all the classifiers
svm_preds = final_svm_model.predict(test_X)
lr_preds = final_lr_model.predict(test_X)
rf_preds = final_rf_model.predict(test_X)
final_preds = [mode([i,j,k])[0][0] for i,j,
k in zip(svm_preds, lr_preds, rf_preds)]
symptoms = X.columns.values
# Creating a symptom index dictionary to encode the input symptoms into numerical form
symptom_index = {}
for index, value in enumerate(symptoms):
symptom = " ".join([i.capitalize() for i in value.split("_")])
symptom_index[symptom] = index
data_dict = {
"symptom_index":symptom_index,
"predictions_classes":encoder.classes_
}
model = pickle.load(open('model.pkl', 'rb'))
# Defining the Function
# Input: string containing symptoms separated by commmas
# Output: Generated predictions by models
def predictDisease(symptoms):
symptoms = symptoms.split(",")
# creating input data for the models
input_data = [0] * len(data_dict["symptom_index"])
for symptom in symptoms:
index = data_dict["symptom_index"][symptom]
input_data[index] = 1
# reshaping the input data and converting it into suitable format for model predictions
input_data = np.array(input_data).reshape(1,-1)
# generating individual outputs
rf_prediction = data_dict["predictions_classes"][final_rf_model.predict(input_data)[0]]
lr_prediction = data_dict["predictions_classes"][final_lr_model.predict(input_data)[0]]
svm_prediction = data_dict["predictions_classes"][final_svm_model.predict(input_data)[0]]
# making final prediction by taking mode of all predictions
final_prediction = mode([rf_prediction, lr_prediction, svm_prediction])[0][0]
predictions = {
"rf_model_prediction": rf_prediction,
"lr_model_prediction": lr_prediction,
"svm_model_prediction": svm_prediction,
"final_prediction":final_prediction
}
if final_prediction == 'Diabetes ':
# return final_prediction
return ("You have Type 1 Diabetes")
else:
# print("Not Diabetes")
# return final_prediction
return ("You do not have Diabetes")
# Testing the function
print(predictDisease("Itching,Skin Rash,Nodal Skin Eruptions"))
# print(predictDisease("Polyuria,Increased Appetite,Excessive Hunger"))
It would be great if someone could guide me on how to get the final output of the model (prediction) to display in a ReactJS frontend. Do I have to do this using flask or is it there another way to send it straight to the frontend?
I'm trying to use mlxtend SequentialFeatureSelector() in combination with a pipeline by using ColumnTransformer(). I use the ColumnTransformer() to make power transformations (via PowerTransformer()) only on the numeric variables, but not on the binary variables. My problem is that I either get the error: AttributeError: 'numpy.ndarray' object has no attribute 'columns' or the results make no sense.
If I define like this: numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5'], then I get the AttributeError: 'numpy.ndarray' object has no attribute 'columns'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
# numeric_features = [0,1,2,3,4,]
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop'))
pipe = make_pipeline((preprocessor), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
If I define numeric_features like this: numeric_features = [0,1,2,3,4] , then I get incorrect results (same value for every iteration) which also include NaNs. See pictures below.
Results of the SequentialFeatureSelector() - Part 1
Results of the SequentialFeatureSelector() - Part 2
The problem lies only in preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop')). If for example, I would remove the specific column transformation and would perform the power transformations on the whole dataset, then code works and results would be:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
pipe = make_pipeline((numeric_transformer), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
Results:
Working results of SFS - Part 1
Working result of SFS - Part 2
Does anyone has an idea how this could be solved? Any help would be much appreciated.
I have a set of data that I have used scikit learn PCA. I scaled the data before performing PCA with StandardScaler().
variance_to_retain = 0.99
np_scaled = StandardScaler().fit_transform(df_data)
pca = PCA(n_components=variance_to_retain)
np_pca = pca.fit_transform(np_scaled)
# make dataframe of scaled data
# put column names on scaled data for use later
df_scaled = pd.DataFrame(np_scaled, columns=df_data.columns)
num_components = len(pca.explained_variance_ratio_)
cum_variance_explained = np.cumsum(pca.explained_variance_ratio_)
eigenvalues = pca.explained_variance_
eigenvectors = pca.components_
I then ran K-Means clustering on the scaled dataset. I can plot the cluster centers just fine in scaled space.
My question is: how do I transform the locations of the centers back into the original data space. I know that StandardScaler.fit_transform() make the data have zero mean and unit variance. But with the new points of shape (num_clusters, num_features), can I use inverse_transform(centers) to get the centers transformed back into the range and offset of the original data?
Thanks, David
you can get cluster_centers on a kmeans, and just push that into your pca.inverse_transform
here's an example
import numpy as np
from sklearn import decomposition
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
scal = StandardScaler()
X_t = scal.fit_transform(X)
pca = decomposition.PCA(n_components=3)
pca.fit(X_t)
X_t = pca.transform(X_t)
clf = KMeans(n_clusters=3)
clf.fit(X_t)
scal.inverse_transform(pca.inverse_transform(clf.cluster_centers_))
Note that sklearn has multiple ways to do the fit/transform. You can do StandardScaler().fit_transform(X) but you lose the scaler, and can't reuse it; nor can you use it to create an inverse.
Alternatively, you can do scal = StandardScaler() followed by scal.fit(X) and then by scal.transform(X)
OR you can do scal.fit_transform(X) which combines the fit/transform step
Here I am using SVR to Fit the data before that I am using scaling technique to scale the values and to get the prediction I am using the Inverse transform function
from sklearn.preprocessing import StandardScaler
#Creating two objects for dependent and independent variable
ss_X = StandardScaler()
ss_y = StandardScaler()
X = ss_X.fit_transform(X)
y = ss_y.fit_transform(y.reshape(-1,1))
#Creating a model object and fiting the data
reg = SVR(kernel='rbf')
reg.fit(X,y)
#To make a prediction
#First we have transform the value into scalar level
#Second inverse tranform the value to see the original value
ss_y.inverse_transform(reg.predict(ss_X.transform(np.array([[6.5]]))))