I'm trying to apply ML on atomic structures using descriptors. My problem is that I get very different score values depending on the datasize I use, I suspect that something is wrong with my model, any suggestions would be appreciated. I used dataset from this paper (Dataset MoS2(single)).
Here is the my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ase
from dscribe.descriptors import SOAP
from dscribe.descriptors import CoulombMatrix
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from ase.io import read
materials = read('structures.xyz', index=':')
materials = materials[:5000]
energies = pd.read_csv('Energy.csv')
energies = np.array(energies['b'])
energies = energies[:5000]
species = ["H", 'Mo', 'S']
rcut = 8.0
nmax = 1
lmax = 1
# Setting up the SOAP descriptor
soap = SOAP(
species=species,
periodic=False,
rcut=rcut,
nmax=nmax,
lmax=lmax,
)
coulomb_matrices = soap.create(materials, positions=[[51]]*len(materials))
nsamples, nx, ny = coulomb_matrices.shape
d2_train_dataset = coulomb_matrices.reshape((nsamples,nx*ny))
df = pd.DataFrame(d2_train_dataset)
df['target'] = energies
from sklearn.preprocessing import StandardScaler
X = df.iloc[:, 0:12].values
y = df.iloc[:, 12:].values
st_x = StandardScaler()
st_y = StandardScaler()
X = st_x.fit_transform(X)
y = st_y.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#krr = GridSearchCV(
# KernelRidge(kernel="rbf", gamma=0.1),
# param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3], "gamma": np.logspace(-2, 2, 5)},
#)
svr = GridSearchCV(
SVR(kernel="rbf", gamma=0.1),
param_grid={"C": [1e0, 1e1, 1e2, 1e3], "gamma": np.logspace(-2, 2, 5)},
)
svr = svr.fit(X_train, y_train.ravel())
print("Training set score: {:.4f}".format(svr.score(X_train, y_train)))
print("Test set score: {:.4f}".format(svr.score(X_test, y_test)))
and score:
Training set score: 0.0414
Test set score: 0.9126
I don't have a full answer to your problem as recreating it would be very cumbersome, but here are some questions to check:
a) You are training on 5 CrossValidation folds (default). First you should check the results of all parameter combinations right after the fitting process with "svr.best_score_" (or more detailed with "svr.cv_results_dict") and see what mean score your folds actually produced. If the score is really is as low as 0.04 (I assume higher is better, which these scores usually do), taking the reciprocal prediction would actually be really accurate! If you know you're always wrong, it's really easy to be right. ;D
b) You could go ahead and just use the svr.best_params_ in order to train again on the whole X_train-set instead of the folds (this can also be achieved with the "refit"-option of RandomSearchCV as well) and then check with the test set again. Here could also be the actual error: The documentation for the score method of GridSearchCV reads: "Return the score on the given data, if the estimator has been refit." This is not the case in your gridsearch! Try turning the refit option on. Maybe that works? ... sorry, your code was too cumbersome to be replicated fast, so I didn't check myself ...
I am trying to run a kNN (k-nearest neighbour) algorithm in Python.
The dataset I am using to try and do this is available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wine
Here is the code I am using:
#1. LIBRARIES
import os
import pandas as pd
import numpy as np
print os.getcwd() # Prints the working directory
os.chdir('C:\\file_path') # Provide the path here
#2. VARIABLES
variables = pd.read_csv('wines.csv')
winery = variables['winery']
alcohol = variables['alcohol']
malic = variables['malic']
ash = variables['ash']
ash_alcalinity = variables['ash_alcalinity']
magnesium = variables['magnesium']
phenols = variables['phenols']
flavanoids = variables['flavanoids']
nonflavanoids = variables['nonflavanoids']
proanthocyanins = variables['proanthocyanins']
color_intensity = variables['color_intensity']
hue = variables['hue']
od280 = variables['od280']
proline = variables['proline']
#3. MAX-MIN NORMALIZATION
alcoholscaled=(alcohol-min(alcohol))/(max(alcohol)-min(alcohol))
malicscaled=(malic-min(malic))/(max(malic)-min(malic))
ashscaled=(ash-min(ash))/(max(ash)-min(ash))
ash_alcalinity_scaled=(ash_alcalinity-min(ash_alcalinity))/(max(ash_alcalinity)-min(ash_alcalinity))
magnesiumscaled=(magnesium-min(magnesium))/(max(magnesium)-min(magnesium))
phenolsscaled=(phenols-min(phenols))/(max(phenols)-min(phenols))
flavanoidsscaled=(flavanoids-min(flavanoids))/(max(flavanoids)-min(flavanoids))
nonflavanoidsscaled=(nonflavanoids-min(nonflavanoids))/(max(nonflavanoids)-min(nonflavanoids))
proanthocyaninsscaled=(proanthocyanins-min(proanthocyanins))/(max(proanthocyanins)-min(proanthocyanins))
color_intensity_scaled=(color_intensity-min(color_intensity))/(max(color_intensity)-min(color_intensity))
huescaled=(hue-min(hue))/(max(hue)-min(hue))
od280scaled=(od280-min(od280))/(max(od280)-min(od280))
prolinescaled=(proline-min(proline))/(max(proline)-min(proline))
alcoholscaled.mean()
alcoholscaled.median()
alcoholscaled.min()
alcoholscaled.max()
#4. DATA FRAME
d = {'alcoholscaled' : pd.Series([alcoholscaled]),
'malicscaled' : pd.Series([malicscaled]),
'ashscaled' : pd.Series([ashscaled]),
'ash_alcalinity_scaled' : pd.Series([ash_alcalinity_scaled]),
'magnesiumscaled' : pd.Series([magnesiumscaled]),
'phenolsscaled' : pd.Series([phenolsscaled]),
'flavanoidsscaled' : pd.Series([flavanoidsscaled]),
'nonflavanoidsscaled' : pd.Series([nonflavanoidsscaled]),
'proanthocyaninsscaled' : pd.Series([proanthocyaninsscaled]),
'color_intensity_scaled' : pd.Series([color_intensity_scaled]),
'hue_scaled' : pd.Series([huescaled]),
'od280scaled' : pd.Series([od280scaled]),
'prolinescaled' : pd.Series([prolinescaled])}
df = pd.DataFrame(d)
#5. TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.matrix(df),np.matrix(winery),test_size=0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
#6. K-NEAREST NEIGHBOUR ALGORITHM
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
In section 5, when I run sklearn.model_selection to import the train-test split mechanism, this does not appear to be running correctly because it provides the shapes: (0,13) (0,178) (1,13) (1,178).
Then, upon trying to run the knn, I get the error message: Found array with 0 sample(s) (shape=(0,13)) while a minimum of 1 is required. This is not due to scaling with max-min normalisation as I still get this error message even when the variables are not scaled.
I'm not exactly sure where your code is going wrong, it's a slightly different way of going about it compared to the sklearn docs. However, I can show you a different way of getting the train test split to work on the wine dataset for you.
from sklearn.datasets import load_wine
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = load_wine(return_X_y=True)
X_scaled = MinMaxScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3)
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
I'm new to machine machine learning algorithms and classification techniques.
I have created a dataset, and trained a model with a SVM in python using sklearn module.
But now I have to change my approach from SVM to artificial immune system. My question thus is, Is there a module for AIS in python that I can use? Just like Sklearn which provides SVM.
If there is none, Where can I find an example or help on implementing one ?
Below is my code in SVM, in case anyone would need it.
# In the name of GOD
# SeyyedMahdi Hassanpour
# SeyyedMahdihp#gmail.com
# SeyyedMahdihp # github
import numpy as np
from sklearn import svm, model_selection
import pandas as pd
df = pd.read_csv('final_dataset123456.csv')
x = np.array(df.drop(['label'], 1))
y = np.array(df['label'])
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.36, random_state=39)
clf = svm.SVC()
clf.fit(x_train, y_train)
accuracy = clf.score(x_test, y_test)
print(accuracy)
ar = [0,0,0,0,0,0,0,0,0,0,0,0,0.2,0.058824,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.25,0,0,0,0,0.020833,0.2,0.090909,0,0.032258,0,0,0,0,0,0.0625,0,0,0,0.058333,0,0,0.1,0,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
br = [0.5,1,1,0.254902,0.853933,1,1,0.254902,1,0.27451,0.2,1,0.4,0.176471,1,1,1,1,0.625,1,0.125,1,0.393939,0.857143,0.052632,1,0.75,0.847826,1,1,0.583333,0.7,1,1,1,0.729167,0.6,0.818182,1,0.193548,0.333333,1,0.674419,1,1,1,0.8,1,1,0.2,0.37037,1,0.8,0.529412,0.375,1,1,0.23913,1,1,1,1,0.666667,1,1,1,1,0,1,0,1,0.23913,0.7,0.7,1,1,1,1,1,1,1,1,0.23913,1,1,1,1,1,1,1,1,1,0.666667,1,0.7,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1]
example_measures = np.array([ar,br])
example_measures = example_measures.reshape(len(example_measures), -1)
prediction = clf.predict(example_measures)
print(prediction)
There are many artificial immune system applications implemented on WEKA Platform. You can download and use it from sourceforge.
Here is the link:
https://sourceforge.net/directory/?q=artificial+immune+system