Accuracy metric of a subsection of categories in Keras

Accuracy metric of a subsection of categories in Keras - python

I've got a 3-class classification problem. Let's define them as classes 0,1 and 2. In my case, class 0 is not important - that is, whatever gets classified as class 0 is irrelevant. What's relevant, however, is accuracy, precision, recall, and error rate only for classes 1 and 2. I would like to define an accuracy metric that only looks at a subsection of the data that relates to 1 and 2 and gives me a measure of that as the model is training. I am not asking for code for accuracy or f1 or precision/recall - those I've found and can implement myself. What I'm asking is for code that can help select a subsection of the categories to perform these metrics on.
Visually, with a confusion matrix:
Given:
> 0 1 2
>0 10 3 4
>1 2 5 1
>2 8 5 9
I would like to only perform an accuracy measure in-training for the following subset only:
> 1 2
>1 5 1
>2 5 9
Possible idea:
Concatenate a categorized, argmaxed y_pred and argmaxed y_true, drop all instances where 0 appears, re-unravel them back into a one_hot array, and do a simple binary accuracy on what remains?
Edit:
I've tried to exclude the 0-class through this code, but it doesn't make sense. the 0-category gets effectively wrapped into the 1-category (that is, the true positives of both 0 and 1 end up being labeled as 1). Still looking for help - can anybody help out please?
#this solution does not work :(
def my_acc(y_true, y_pred):
#excluding the 0-category
y_true_cust = y_true[:,np.r_[1:3]]
y_pred_cust = y_pred[:,np.r_[1:3]]
#binary accuracy source code, slightly edited
y_pred_cat = Ker.round(y_pred_cust)
eql_cust = Ker.equal(y_true_cust, y_pred_cust)
return Ker.mean(eql_cust, axis = -1)
# Ashwin Geet D'Sa
correct_guesses_3cat = 10 + 5 + 9
print(correct_guesses_3cat)
24
total_guesses_3cat = 10+3+4+2+5+1+8+5+9
print(total_guesses_3cat)
47
accuracy_3cat = 24/47
print(accuracy_3cat)
51.1 %
correct_guesses_2cat =5 + 9
print(correct_guesses_2cat)
14
total_guesses_2cat = 5+1+5+9
print(total_guesses_2cat)
20
accuracy_2cat = 14/20
print(accuracy_2cat)
70.0 %

Related

Multivariate Times Series Classification using Machine Learning Algorithms

I am fairly new to machine learning and am currently working on a way to classify time series data. In order to do so, I would like to get a better understanding of how time series data can be fed into machine learning algorithms.
Further information:
Each sample is a time series consisting of 2000 time points. For each time point, there are several variables, like temperature, speed, acceleration, etc. The data can be represented like this:
data structure for one time series sample
The whole dataset consists of 3000 samples. 3000 samples x 2000 data points per sample = 6000000 data points for each variable.
the goal is to classify the samples into classes from 0 to 4.
My first attempt was just feeding the data as an array into the machine learning algorithms.
Let's say, we just focus on temperature. We can now structure the data like this:
input training data for a ml-algorithm
. Let X be the training input and y be the training output, the data looks like:
[21,21,22,...]=0
[35,35,35,...]=2
[11,12,12,...]=1
[18,17,18,...]=0
Can I just feed the machine learning algorithm (like SVCs) with array-type time series data like this? How does the algorithm know that the elements in the array are chronological data and not single features?
Here is an example code of what I did so far:
dataframe.head()
'sample_nr' 'timestamp' 'temperature' 'speed' 'acceleration'
0 1 0.01 21 -0.43 0.34205
1 1 0.02 21 -0.43 0.34205
2 1 0.03 22 -0.43 0.34205
Create a data_list, which contains all the sample_nr's in a list. Also, the dataframe gets grouped by the sample_nr
data_list = []
for sample_nr, sample_df in dataframe.groupby('sample_nr'):
dataframe.groupby('sample_nr'):
data_list.append(dataframe)
For a first step, we will only focus on one feature, let's say the temperature:
X_list = []
y_list = []
for sample in data_list:
temp_X = np.array(sample['temperature'])
temp_y = sample['label'].unique()[0]
X_list.append(temp_X)
y_list.append(temp_y)
Transform the lists to pandas.Dataframes:
X_df = pd.DataFrame(X_list)
y_df = pd.DataFrame(y_list)
Now, the X_df is a 3000x2000 list: Each row describes a sample, and the values in the columns are the temperature values for each of the 2000 time steps:
print(X_df)
....0....1....2....3
0 21 21 22 22
1 35 35 35 36
2 11 12 12 12
Also, for the output value:
print(y_df)
....0
0 0
1 2
2 1
Now split up the dataframe to train and test data:
X_train_array, X_test_array, y_train_array, y_test_array = train_test_split(X_df, y_df, test_size=0.2, shuffle=True, random_state=42)
X_train_df = pd.DataFrame(X_train_array)
X_test_df = pd.DataFrame(X_test_array)
y_train_df = pd.DataFrame(y_train_array)
y_test_df = pd.DataFrame(y_test_array)
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train_df, y_train_df)

Sum the predictions of a Linear Regression from Scikit-Learn

I need to make a linear regression and sum all the predictions. Maybe this isn't a question for Scikit-Learn but for NumPy because I get an array at the end and I am unable to turn it into a float.
df
rank Sales
0 1 18000
1 2 17780
2 3 17870
3 4 17672
4 5 17556
x = df['rank'].to_numpy()
y = df['Sales'].to_numpy()
X = x.reshape(-1,1)
regression = LinearRegression().fit(X, y)
I am getting it right up to this point. The next part (which is a while loop to sum all the values) is not working:
number_predictions = 100
x_current_prediction = 1
total_sales = 0
while x_current_prediction <= number_predictions:
variable_sum = x_current_prediction*regression.coef_
variable_sum_float = variable_sum.astype(np.float_)
total_sales = total_sales + variable_sum_float
x_current_prediction =+1
return total_sales
I think that the problem is getting regression.coef_ to be a float, but when I use astype, it does not work?

You don't need to loop like this, and you don't need to use the coefficient to compute the prediction (don't forget there may be an intercept as well).
Instead, make an array of all the values of x you want to predict for, and ask sklearn for the predictions:
X_new = np.arange(1, 101).reshape(-1, 1) # X must be 2D.
y_pred = regression.predict(X_new)
If you want to add all these numbers together, use y_pred.sum() or np.sum(y_pred), or if you want a cumulative sum, np.cumsum(y_pred) will do it.

Index tensor must have the same number of dimensions as self tensor

I have a dataset which looks like
ID Target Weight Score Scale_Cat Scale_num
0 A D 65.1 87 Up 1
1 A X 35.8 87 Up 1
2 B C 34.7 37.5 Down -2
3 B P 33.4 37.5 Down -2
4 C B 33.1 37.5 Down -2
5 S X 21.4 12.5 NA 9
This dataset consists of nodes (ID) and targets (neighbors) and it has been used as sample for testing label propagation. Classes/Labels are within the column Scale_num and can take values from -2 to 2 at step by one. The label 9 means unlabelled and it is the label that I would like to predict using label propagation algorithm.
Looking for some example on Google about label propagation, I have found this code useful (difference is in label assignment, since in my df I have already information on data which have labelled - from -2 to 2 at step by 1, and unlabelled, i.e. 9): https://mybinder.org/v2/gh/thibaudmartinez/label-propagation/master?filepath=notebook.ipynb
However, trying to use my classes instead of (-1,0,1) as in the original code, I have got some errors. A user has provided some help here: RunTimeError during one hot encoding, for fixing a RunTimeError, unfortunately still without success.
In the answer provided on that link, 40 obs and labels are randomly generated.
import random
labels = list()
for i in range(0,40):
labels.append(list([(lambda x: x+2 if x !=9 else 5)(random.sample(classes,1)[0])]))
index_aka_labels = torch.tensor(labels)
torch.zeros(40, 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
The error I am getting, still a RunTimeError, seems to be still due to a wrong encoding. What I tried is the following:
import random
labels = list(df['Scale_num'])
index_aka_labels = torch.tensor(labels)
torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
getting the error
---> 7 torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
RuntimeError: Index tensor must have the same number of dimensions as self tensor
For sure, I am missing something (e.g., the way to use classes and labels as well as src, which has never been defined in the answer provided in that link).
The two functions in the original code which are causing the error are as follows:
def _one_hot_encode(self, labels):
# Get the number of classes
classes = torch.unique(labels) # probably this should be replaced
classes = classes[classes != -1] # unlabelled. In my df the unlabelled class is identified by 9
self.n_classes = classes.size(0)
# One-hot encode labeled data instances and zero rows corresponding to unlabeled instances
unlabeled_mask = (labels == -1) # In my df the unlabelled class is identified by 9
labels = labels.clone() # defensive copying
labels[unlabeled_mask] = 0
self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
self.one_hot_labels[unlabeled_mask, 0] = 0
self.labeled_mask = ~unlabeled_mask
def fit(self, labels, max_iter, tol):
self._one_hot_encode(labels)
self.predictions = self.one_hot_labels.clone()
prev_predictions = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
for i in range(max_iter):
# Stop iterations if the system is considered at a steady state
variation = torch.abs(self.predictions - prev_predictions).sum().item()
prev_predictions = self.predictions
self._propagate()
I would like to understand how to use in the right way my classes/labels definition and info from my df in order to run the label propagation algorithm with no errors.

I suspect it's complaining about index_aka_labels lacking the singleton dimension. Note that in your example which works:
import random
labels = list()
for i in range(0,40):
labels.append(list([(lambda x: x+2 if x !=9 else 5)(random.sample(classes,1)[0])]))
index_aka_labels = torch.tensor(labels)
torch.zeros(40, 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
If you run index_aka_labels.shape, it returns (40,1). When you just turn your pandas series into a tensor, however, it will return a tensor of shape (M) (where M is the length of the series). If you simply run:
import random
labels = list(df['Scale_num'])
index_aka_labels = torch.tensor(labels)[:,None] #create another dimension
torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
the error should disappear.
One more thing, you are not converting your labels into indices as you did in the top example. To do that, you can run:
import random
labels = list(df['Scale_num'])
index_aka_labels = torch.tensor(labels)[:,None] #create another dimension
index_aka_labels = index_aka_labels + 2 # labels are [-2,-1,0,1,2] and convert them to [0,1,2,3,4]
index_aka_labels[index_aka_labels==11] = 5 #convert label 9 to index 5
torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)

Documents-terms matrix dimensionality reduction

I am working with text documents clustering, with a Hierarchical Clustering approach, in Python.
I have a corpus of 10k documents and have constructed a documents-terms matrix over a dictionary based on a collection of terms classified as 'keyword' for the entire corpus.
The matrix has a shape: [10000 x 2000] and is very sparse. (let's call it dtm)
id 0 1 2 4 ... 1998 1999
0 0 0 0 1 ... 0 0
1 0 1 0 0 ... 0 1
2 1 0 0 0 ... 1 0
.. .. ... ... .. ..
9999 0 0 0 0 ... 0 0
I think that applying some dimensionality reduction techniques could lead to an enhancement in the precision of clustering.
I have tried using some MDS approach like this
def select_n_components(var_ratio, goal_var: float) -> int:
# Set initial variance explained so far
total_variance = 0.0
# Set initial number of features
n_components = 0
# For the explained variance of each feature:
for explained_variance in var_ratio:
# Add the explained variance to the total
total_variance += explained_variance
# Add one to the number of components
n_components += 1
# If we reach our goal level of explained variance
if total_variance >= goal_var:
# End the loop
break
# Return the number of components
return n_components
def do_MDS(dtm):
# scale dtm in range [0:1] to better variance maximization
scl = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scl.fit_transform(dtm)
tsvd = TruncatedSVD(n_components=data_rescaled.shape[1] - 1)
X_tsvd = tsvd.fit(data_rescaled)
# List of explained variances
tsvd_var_ratios = tsvd.explained_variance_ratio_
optimal_components = select_n_components(tsvd_var_ratios, 0.95)
from sklearn.manifold import MDS
mds = MDS(n_components=optimal_components, dissimilarity="euclidean", random_state=1)
pos = mds.fit_transform(dtm.values)
U_df = pd.DataFrame(pos)
U_df_transposed = U_df.T # for consistency with pipeline workflow, export tdm matrix
return U_df_transposed
The objective is to automatically detect an optimal number of components and apply the dimensionality reduction. But the output has not shown a tangible enhancement.

Regression by group in python pandas

I want to ask a quick question related to regression analysis in python pandas.
So, assume that I have the following datasets:
Group Y X
1 10 6
1 5 4
1 3 1
2 4 6
2 2 4
2 3 9
My aim is to run regression; Y is dependent and X is independent variable. The issue is I want to run this regression by Group and print the coefficients in a new data set. So, the results should be like:
Group Coefficient
1 0.25 (lets assume that coefficient is 0.25)
2 0.30
I hope I can explain my question.
Many thanks in advance for your help.

I am not sure about the type of regression you need, but this is how you do an OLS (Ordinary least squares):
import pandas as pd
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
#This is what you need
df.groupby('Group').apply(regress, 'Y', ['X'])
You can define your regression function and pass parameters to it as mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accuracy metric of a subsection of categories in Keras - python

Related

Multivariate Times Series Classification using Machine Learning Algorithms

Sum the predictions of a Linear Regression from Scikit-Learn

Index tensor must have the same number of dimensions as self tensor

Documents-terms matrix dimensionality reduction

Regression by group in python pandas

Categories

Resources