I need to clasify some text in labels of emotions. I'm using Multi-Label Classification because the same text can contain more than one emotion, but I want to implement that some of them be disjoint like happy/sad or calm/angry.
Let's imagine that I have this code in Python:
from simpletransformers.classification import (
MultiLabelClassificationModel, MultiLabelClassificationArgs
)
model_args = MultiLabelClassificationArgs(num_train_epochs=1)
# Create a MultiLabelClassificationModel
model = MultiLabelClassificationModel(
"roberta", "roberta-base", num_labels=4,
)
with this sample:
train_data = [
["AAA", [1, 0, 0, 1]],
["BBB", [0, 1, 1, 1]],
["CCC", [1, 0, 1, 1]],
]
and I want to set the first and second labels must be disjoint. How I could do it?
I suggest putting efforts into post-preprocessing (i.e. after obtaining the prediction logits).
Assuming you have conducted threshold selection (maybe through ROC curve), one possible way:
Separate labels into disjoint groups (e.g. happy/sad, calm/angry)
Within each group, determine the priority of each label (e.g. Happy > Sad)
Obtain the final labels by firstly classifying whether it's Happy, and then classify whether it's sad only when it's not Happy
Fine-tune the threshold of labels calculated earlier such that the metrics of all labels satisfy your needs
In conclusion, be flexible and focus on your "business" objectives (or higher level objectives) after obtaining the logits from ML model! :)
Related
I have a dataset with a few categories that are coded as being the presence or absence of an intervention. I am fitting a simple linear regression using OLS. Multiple interventions may be present at the same time so I am adding an effect for each effect. However the dummy variable is not being encoded the way I want it to be and it make the effects hard to interpret.
data = {
"cat1": [0,0,0,1,1,1,1],
"cat2": [0,0,0,0,0,1,1],
"cat3": [1,1,0,0,1,1,1],
"cat4": [0,1,0,1,1,1,1],
}
#load data into a DataFrame object:
dftest = pd.DataFrame(data)
# variable to store the label mapping into
label_map = {}
for fact in dftest.columns :
dftest[fact], label_map[fact] = pd.factorize(dftest[fact])
print(label_map[fact] )
Produces the following dummy coding output...
Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')
Int64Index([1, 0], dtype='int64')
Int64Index([0, 1], dtype='int64')
Question 1)
How do I ensure that the 0 in the original mapping is always the dummy?
Question 2)
If I code an interaction effect between in statsmodels will that dummy carry over to the interaction? I have another categorical with many levels column which in the model has an interaction with each of the other effects, I don't care what is the dummy in that category but I do care that it's interaction effects use the 0 from the other categories as their dummy i.e. their reference.
Question 3)
If it's not possible to fix can I just reverse the effect sizes of the resulting model where appropriate? So if 1 is the dummy for that effect then I just multiply the effect sizes by -1?
For a classification task, I am using sklearn VotingClassifier to ensemble Random Forest and Extra-Tree Classifier with parameter set to voting='hard'. I don't understand how it works correctly since both Tree-Based models already give final prediction using voting technique. How can they work in combination using hard voting? Also if there is a tie between two models?
Can anyone explain this with example?
You can look that up from the source code of the voting classifier.
For short, it doesn't make much sense to use two classifiers with hard-voting. Rather use soft-voting.
The reason is, that in hard voting modus, the sklearn VotingClassifier votes for the mayority vote and with two it only gets interesting if there is a tie. In case there are as many zeros as there are ones in a binary classification, the hard voting classifier will vote for 0.
You can simply test that by looking at the code, that it executed:
First set up the data for the experiment:
import numpy as np
# create a random int array with values 0 and 1
# with 20 rows (20 predictions) of 10 voters (10 columns)
a = np.random.randint(0, 2, size=(20,10))
# then produce some tie-lines with different combinations
a[0,:] = [0]*5 + [1]*5 # the first 5 predict 0 the next 5 predict 1
a[1,:] = [1]*5 + [0]*5 # vice versa
a[2,:] = [0,1]*5 # the first predicts 0 then 1 follows 0 and 0 follows 1
a[3,:] = [1,0]*5 # the same, just starting with 1
# if you want to check, how many ones you have, do:
a.sum(axis=1)
Now see, what the voter code does with this. The voter code for hard voting is (the code below simulates the case, where you have equally weighted classiifiers weights=[1]*10):
np.apply_along_axis(
lambda x: np.argmax(
np.bincount(x, weights=[1.0]*10)),
axis=1, arr=a)
The result is:
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
You see, that for the first four entries (the ties we manually introduced), the result is 0. So in case of a tie, the voting classifier will always choose 0 (if you choose the same weight for each of the classifiers). Note, that the weights are not only used to resolve ties, so you could have a classifier which has the double weight of another classifier, so you might even get a tie with 3 classifiers this way. But whenever the sum of predictionweight for all 0-predicting classifiers is equal to the sumf of predicitonweight for all 1-predicting classifiers, the voting classifier will predict 0 rather than 1.
Here is the relevant code:
Sklearn Voting code and Description of numpy.argmax
im trying to replicate an experiment on a paper using SVM, to increment my learning/knownledge on machine learning. In this paper, the author extracts the features and chooses the feature sizes. He, then shows a table where F represents the size of the feature vector and N represents the face images
He then works with F >= 9 and N >= 15 parameters.
Now, what i want to do is to actually grab the features i extract as he does in the paper.
Basically, this is how i extract the features:
def load_image_files(fullpath, dimension=(64, 64)):
descr = "A image classification dataset"
images = []
flat_data = []
target = []
dimension=(64, 64)
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
for person in os.listdir(path):
personfolder = os.path.join(path, person)
for imgname in os.listdir(personfolder):
class_num = CATEGORIES.index(category)
fullpath = os.path.join(personfolder, imgname)
img_resized = resize(skimage.io.imread(fullpath), dimension, anti_aliasing=True, mode='reflect')
flat_data.append(img_resized.flatten())
images.append(skimage.io.imread(fullpath))
target.append(class_num)
flat_data = np.array(flat_data)
target = np.array(target)
images = np.array(images)
print(CATEGORIES)
return Bunch(data=flat_data,
target=target,
target_names=category,
images=images,
DESCR=descr)
How do i select the amount of features extracted and stored? or how do i manually store a vector with the amount of features that i need? For instance a feature vector of size 9
I'm trying to separate my features this way:
X_train, X_test, y_train, y_test = train_test_split(
image_dataset.data, image_dataset.target, test_size=0.3,random_state=109)
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X_train, y_train)
print(model.feature_importances_)
Though, my output is:
[0. 0. 0. ... 0. 0. 0.]
for SVM classification, im trying to use OneVsRestClassifier
model_to_set = OneVsRestClassifier(SVC(kernel="poly"))
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly", "rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters)
model_tunning
model_tunning.fit(X_train, y_train)
prediction = model_tunning.best_estimator_.predict(X_test)
Then, once i call prediction, i get:
Out[29]:
array([1, 0, 4, 2, 1, 3, 3, 0, 1, 1, 3, 4, 1, 1, 0, 3, 2, 2, 2, 0, 4, 2,
2, 4])
So you've got two arrays of image information (one unprocessed, the other resized and flattened) as well as a list of corresponding class values (which we usually call labels). There are currently 2 things not quite right with the setup, however:
1) What's missing here are multiple features - these might include specific arrays from data associated with feature extraction from morphological/computer vision processes of your images, or they may be ancillary data like a list of preferences, behaviors, purchases. Basically, anything that can act as an array in either a numerical or categorical format. Technically speaking, your resized images are a second feature, but I don't think this will add much if any improvement in model performance.
2) target_names=category in your function return will store the last iteration pf category in CATEGORIES. I don't know if this is what you want.
Going back to your table, N would refer to the number of images in the dataset, and F would be the number of corresponding feature arrays associated with that image. By way of example, let's say we have fifty individual wines and five features (colour, taste, alcohol content, pH, optical density). N of 5 would be five of those wines, and F of 2 would be, say, colour, taste.
If I had to guess at what your features would be, they would in fact be a single feature - the image data itself. Looking at your data structure, every label/category you have will have multiple individuals (people) each with multiple examples of images of that person. Note that multiple individuals are not separate features - the way you're structuring the data, the individuals are grouped together under a single category.
So, where to from here? Without knowing what paper you're reading it's hard to suggest what to do, but I would go back and see if you can perhaps provide us with more information about the problem.
Non-Censored (Complete) Dataset
I am attempting to use the scipy.stats.weibull_min.fit() function to fit some life data. Example generated data is contained below within values.
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 20683.2,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
I attempt to fit using the function:
fit = scipy.stats.weibull_min.fit(values, loc=0)
The result:
(1.3392877335100251, -277.75467055900197, 9443.6312323849124)
Which isn't far from the nominal beta and eta values of 1.4 and 10000.
Right-Censored Data
The weibull distribution is well known for its ability to deal with right-censored data. This makes it incredibly useful for reliability analysis. How do I deal with right-censored data within scipy.stats? That is, curve fit for data that has not experienced failures yet?
The input form might look like:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, np.inf,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
or perhaps using np.nan or simply 0.
Both of the np solutions are throwing RunTimeWarnings and are definitely not coming close to the correct values. I using numeric values - such as 0 and -1 - removes the RunTimeWarning, but the returned parameters are obviously flawed.
Other Softwares
In some reliability or lifetime analysis softwares (minitab, lifelines), it is necessary to have two columns of data, one for the actual numbers and one to indicate if the item has failed or not yet. For instance:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 0,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
censored = np.array(
[True, True, True, True, False,
True, True, True, True, True]
)
I see no such paths within the documentation.
Old question but if anyone comes across this, there is a new survival analysis package for python, surpyval, that handles this, and other cases of censoring and truncation. For the example you provide above it would simply be:
import surpyval as surv
values = np.array([10197.8, 3349.0, 15318.6, 142.6, 6976.5, 2590.7, 11351.7, 10177.0, 3738.4])
# 0 = failed, 1 = right censored
censored = np.array([0, 0, 0, 0, 0, 1, 1, 1, 0])
model = surv.Weibull.fit(values, c=censored)
print(model.params)
(10584.005910580288, 1.038163987652635)
You might also be interested in the Weibull plot:
model.plot(plot_bounds=False)
Weibull plot
Full disclosure, I am the creator of surpyval
For a multiclass problem I use Scikit-Learn. I find very little examples on how to load a custom dataset with multiple classes. The sklearn.datasets.load_files method does not seem to be suitable as files need to be stored multiple times. I now have the following structure:
X => Python list with lists of features (in text).
y => Python list with lists of classes (in text).
How do I transform this to a structure Scikit-Learn can use in a classifier?
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
X = np.loadtxt('samples.csv', delimiter=",")
y_aux = np.loadtxt('targets.csv', delimiter=",")
y = MultiLabelBinarizer().fit_transform(y_aux)
Code explanation: Let's say you have all your features stored in a file called samples.csv and the multiclass labels in another file called targets.csv (they could be of course stored in the same file and you'd just need to split columns). For clarity in this example my files contain:
samples.csv
4.0,3.2,5.5
6.8,5.6,3.3
targets.csv
1,4 <-- sample one belongs to classes 1 and 4
2,3 <-- sample two belongs to classes 2,3
MultiLabelBinarizer encodes the output targets in such a way that y variable is ready to be fed into Multiclass classifiers. The output of the code is:
y = array([[1, 0, 0, 1],
[0, 1, 1, 0]])
meaning sample one belongs to classes 1 and 4 and sample two belongs to 2 and 3.