I have irradiated some radiochromic film at different doses and scanned them in as a Tiff, 48 bit RGB to use as a calibration. My dataset comprises of three columns representing the colour channels (X) and the corresponding dose (y). The input data is of the form
I have used the sklearn RandomForestRegressor to train and test successfully. When I use the following (net optical density RGB values)
X = df[['RedDensity', 'GreenDensity', 'BlueDensity']].values
y = df['Dose'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
regressor.predict(np.array([0.349, 0.296, 0.107]).reshape(1, 3))
I get a predicted value for the dose. My question is how do I predict the values of a test image of (m x n x 3)? I could scan across the test image and read the RGB value of each element but is there a more elegant way?
Thanks
James
Related
What is purpose of this line :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)
For neural networks you have input features (X) and output labels (Y). It's very important to split your data into a training dataset and testing dataset.
To make this easy sklearn has a function called
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None).
Here's the documentation for sklearn.model_selection.train_test_split
Going through the function we can see that:
1.) X is your input features array
2.) Y is your output label array
3.) test_size = 0.25 states that you want your testing data to be 25% of your overall data. Therefore your training data will be 75% of your overall data.
4.) random_state = 1 Controls the shuffling applied to the data before applying the split.
5.) Your question is why do you have 4 outputs (X_train, X_test, y_train, y_test). It is because X will be split into X_train (75%) and X_test (25%) and then Y will be split into y_train (75%) and y_test (25%). It's all put onto one line.
I have this issue when trying to use sklearn.preprocessing.MinMaxScaler on a large array and obtaining the scaling parameters to do "redo" the normalization after handling the array for a while.
The issue I have is that after doing my MinMaxScaler.fit_transform(data), where data is a numpy array with shape (8,412719), the scaling parameters obtained with MinMaxScaler.scale_ is just a list with length 412719.
How do I obtain an array with scaling parameters instead? I'm missing 7 columns worth of scaling parameters if I've not misunderstood something.
I build my X dataframe and y target then scaler the X dataframe
df3.dropna(inplace=True)
X_Columns=[column for column in df3.columns if not column in["Target","DateTime","Date","CO2Intensity","ActualWindProduction","ORKWindspeed","ForecastWindProduction"]]
#print(X_Columns)
X=df3[X_Columns]
#print(X)
y=df3["Target"]
scaler=MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
classifier = GaussianNB()
classifier.fit(X_train_scaled, y_train)
I am doing a project based on Machine learning (Python) and trying all models on my data.
Really confused in
For Classification and For Regression
If I have to apply normalization, Z Score or Standard deviation on whole data set and then set the values of Features(X) and output(y).
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
data=normalize(data)
X=data.drop['col']
y=data['col']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Or only have to apply on features(X)
X=data.drop['col']
y=data['col']
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
X=normalize(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
TLDR; do normalization on input data, but don't do it on output.
Logically, the normalization is both algorithm dependent and also feature based.
Some algorithms do not require any normalization (like decision trees).
Applying normalization on the dataset: You should perform normalization per feature but on all examples existing in the whole dataset if you have more than one feature in your dataset.
For example, let's say you have two features of X and Y. feature X is always a decimal in the range [0,10]. On the other hand, you have Y in the range [100K,1M]. If you do normalization once for X and Y and once for X and Y combined, you would see how the values of feature X become insignificant.
For Output (labels):
Generally, there is no need to normalize output or labels for any regression or classification tasks. But, make sure to do normalization on training data during training time and inference time.
if the task is the classification, the common approach is just encoding the class numbers (if you have classes dog and cat. you assign 0 to one and 1 to the other)
I'm trying to write a loop trying different values of degree while training an SVM with the poly kernel in python using the digits dataset. I'd also like every value of degree to have an accuracy and to plot the graphs with degrees on the x axis and the test accuracy on the y axis.
Here's what I have so far...it's practically nothing but I don't know where to begin. Thanks in advance.
from sklearn.datasets import load_digits
digits = load_digits()
# Create the features matrix
X = digits.data
# Create the target vector
y = digits.target
# Create training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.25,
random_state=1)
from sklearn.svm import SVC
classifierObj1 = SVC(kernel='poly', degree=3)
classifierObj1.fit(X_train, y_train)
y_pred = classifierObj1.predict(X_test)
I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:
from scipy.spatial import distance
def euc(a,b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self, x_test):
predictions = []
for row in x_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.x_train[0])
best_index = 0
for i in range(1, len(self.x_train)):
dist = euc(row, self.x_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
print(x_train.shape, x_test.shape)
my_classifier = ScrappyKNN()
my_classifier .fit(x_train, y_train)
prediction = my_classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))
Results are as follows:
(75, 4) (75, 4)
0.96
The 96% is the accuracy, but what exactly do the 75 and 4 represent?
You are printing the shapes of the datasets on this line:
print(x_train.shape, x_test.shape)
Both x_train and x_test seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.
From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5), what you are doing is to split the train and test data into 50% each. sklearn train-test-split actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4) are your number of rows, followed by the number of features in the train set and test set respectively.
Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.
This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test).)
TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).
Note: When performing train-test-split its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.