Related
I'm using PyTorch to train a model.
My validation_labels (ground truth labels) consists of the following values:
tensor([2, 0, 2, 2, 2, 0, 1, 1, 0, 2, 2, 0, 1, 2, 1, 2, 1, 1, 0, 1, 2, 2, 1, 2,
2, 2, 2, 1, 2, 1, 0, 2, 0, 2, 2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2,
1, 1, 0, 2, 1, 0, 2, 2, 2, 2, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 2, 2,
2, 2, 1, 2, 0, 2, 0, 1, 1, 2, 2, 0, 2, 2, 1, 1, 2, 0, 2, 2, 2, 2, 2, 0,
2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 0, 0, 0, 1, 0, 2, 1, 2, 1, 2, 0, 2, 1, 2,
1, 0, 1, 2, 2, 2, 2, 0, 2, 1, 0, 2, 1, 2, 1, 1, 0, 1, 2, 2, 2, 2, 1, 0,
1, 1, 0, 2, 2, 1, 2, 2, 0, 1, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 2, 2, 2, 2,
2, 1, 2, 2, 1, 0, 2, 1, 2, 2, 2, 2, 0, 2, 0, 0, 2, 1, 2, 0, 0, 2, 0, 2,
0, 0, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 0, 1, 2, 1, 2, 0, 0, 1, 1, 1, 2,
1, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 1, 0, 2, 1, 2, 2, 0, 2, 2, 0, 1, 0,
1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 1, 0, 1, 2, 1, 0, 1, 2,
2, 2, 1, 2, 2, 2, 1, 0, 1, 2, 2, 0, 2, 2, 2, 0, 1, 2, 0, 2, 2, 0, 0, 1,
1, 1, 1, 1, 1, 2, 0, 2, 1, 0, 2, 1, 0, 2, 2, 2, 2, 2, 1, 1, 0, 2, 2, 2,
2, 2, 0, 2, 0, 2, 2, 2, 1, 1, 0, 2, 1, 0, 0, 2, 0, 2, 1, 2, 0, 2, 2, 1,
1, 1, 2, 2, 2, 0, 1, 0, 1, 2, 2, 2, 2, 2, 0, 1, 2, 0, 0, 0, 2, 1, 2, 0,
2, 1, 2, 1, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 1, 1, 2, 2, 2,
2, 0, 2, 2, 0, 2, 0, 1, 1, 0, 2, 0, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 0, 0,
2, 2, 2, 2, 2, 0, 2, 2, 0, 1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 1, 2, 1,
2, 2, 2, 2, 1, 1, 1, 0, 0, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 0, 0, 0,
0, 1, 1, 0, 0], device='mps:0')
But, using the below code to generate a DataLoader results in all the validation_labels being converted to '2's.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
for step, batch in enumerate(validation_dataloader):
batch = tuple(t.to(device) for t in batch)
eval_data, eval_masks, eval_labels = batch
print(eval_labels)
The eval labels get printed as:
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
Why are all the labels being changed to '2'? I'm not able to find out what is wrong with my code. Could someone tell me why this happens and what I should do about it?
This happened to me because the folder I was passing to the dataloder was the parent folder of the actual training data. i.e. Data was present in training/training. By removing the outer layer, the dataloder was able to read the labels correctly.
I am trying to build a object classification model, but when trying to print out the classification report it returned a value error.
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
This is my current code:
train_size = int(len(df) * 0.7,)
train_text = df['cleansed_text'][:train_size]
train_cat = df['category'][:train_size]
test_text = df['cleansed_text'][train_size:]
test_cat = df['category'][train_size:]
max_words = 2500
tokenize = text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_text)
x_train = tokenize.texts_to_matrix(train_text)
x_test = tokenize.texts_to_matrix(test_text)
encoder = LabelEncoder()
encoder.fit(train_cat)
y_train = encoder.transform(train_cat)
y_test = encoder.transform(test_cat)
num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Dense(256, input_shape=(max_words,)))
model.add(Dropout(0.5))
model.add(Dense(256,))
model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
history = model.fit(x_train, y_train,
batch_size=32,
epochs=10,
verbose=1,
validation_split=0.1)
from sklearn.metrics import classification_report
y_test_arg=np.argmax(y_test,axis=1)
Y_pred = np.argmax(model.predict(x_test),axis=1)
print('Confusion Matrix')
print(confusion_matrix(y_test_arg, Y_pred))
print(classification_report(y_test_arg, y_pred, labels=[1,2,3,4,5]))
However, when I attempt to print out the classification report, it ran into this error:
21/21 [==============================] - 0s 2ms/step
Confusion Matrix
[[138 1 6 0 2]
[ 0 102 3 0 2]
[ 3 2 121 1 2]
[ 1 0 1 157 0]
[ 0 3 0 0 123]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [56], in <cell line: 8>()
5 print('Confusion Matrix')
6 print(confusion_matrix(y_test_arg, Y_pred))
----> 8 print(classification_report(y_test_arg, y_pred, labels=[1,2,3,4,5]))
File ~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:2110, in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict, zero_division)
1998 def classification_report(
1999 y_true,
2000 y_pred,
(...)
2007 zero_division="warn",
2008 ):
2009 """Build a text report showing the main classification metrics.
2010
2011 Read more in the :ref:`User Guide <classification_report>`.
(...)
2107 <BLANKLINE>
2108 """
-> 2110 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
2112 if labels is None:
2113 labels = unique_labels(y_true, y_pred)
File ~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:93, in _check_targets(y_true, y_pred)
90 y_type = {"multiclass"}
92 if len(y_type) > 1:
---> 93 raise ValueError(
94 "Classification metrics can't handle a mix of {0} and {1} targets".format(
95 type_true, type_pred
96 )
97 )
99 # We can't have more than one value on y_type => The set is no more needed
100 y_type = y_type.pop()
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
y_test_arg
array([3, 3, 1, 0, 4, 1, 0, 4, 3, 4, 1, 1, 2, 2, 3, 0, 0, 4, 1, 3, 2, 0,
4, 1, 2, 3, 1, 2, 2, 4, 3, 2, 0, 2, 1, 4, 3, 2, 1, 1, 0, 3, 4, 4,
3, 1, 4, 2, 4, 3, 2, 2, 3, 1, 3, 2, 3, 4, 1, 3, 1, 0, 0, 1, 1, 1,
4, 3, 0, 0, 2, 2, 0, 2, 1, 3, 3, 4, 2, 3, 0, 3, 0, 4, 3, 3, 0, 1,
3, 3, 4, 3, 0, 2, 0, 1, 4, 1, 2, 0, 1, 2, 1, 2, 2, 0, 3, 3, 3, 4,
4, 3, 2, 1, 4, 3, 1, 0, 1, 2, 0, 3, 4, 0, 3, 2, 0, 1, 1, 1, 2, 1,
2, 1, 3, 1, 3, 2, 2, 0, 2, 4, 3, 4, 3, 0, 2, 4, 1, 1, 2, 1, 2, 3,
3, 2, 0, 4, 3, 2, 2, 1, 3, 2, 2, 0, 4, 4, 0, 4, 3, 3, 0, 2, 0, 4,
3, 4, 2, 1, 3, 0, 3, 1, 4, 4, 3, 2, 3, 0, 3, 0, 3, 3, 1, 1, 0, 4,
4, 0, 4, 0, 0, 3, 3, 2, 3, 4, 3, 4, 3, 3, 0, 0, 4, 3, 0, 4, 4, 2,
3, 0, 1, 1, 4, 2, 3, 3, 4, 0, 4, 1, 1, 2, 2, 0, 1, 3, 1, 1, 0, 3,
2, 4, 0, 3, 1, 4, 2, 2, 3, 3, 0, 0, 0, 0, 0, 1, 0, 2, 2, 4, 4, 1,
2, 1, 0, 2, 3, 3, 0, 4, 0, 4, 3, 0, 0, 2, 3, 3, 2, 2, 1, 1, 2, 0,
2, 2, 0, 4, 2, 2, 2, 2, 2, 1, 1, 4, 2, 3, 2, 3, 4, 3, 3, 3, 1, 4,
1, 4, 3, 4, 3, 3, 1, 1, 0, 1, 1, 2, 0, 3, 4, 4, 2, 0, 3, 0, 1, 3,
2, 1, 3, 3, 0, 2, 4, 4, 0, 0, 3, 2, 1, 3, 3, 2, 1, 4, 3, 1, 0, 2,
3, 2, 4, 1, 3, 2, 0, 1, 2, 1, 2, 3, 2, 0, 0, 2, 0, 4, 3, 0, 1, 0,
3, 3, 1, 4, 2, 4, 2, 2, 3, 3, 3, 0, 4, 1, 0, 3, 0, 3, 0, 4, 0, 0,
0, 0, 3, 3, 3, 0, 0, 1, 0, 0, 0, 3, 3, 3, 4, 0, 3, 3, 3, 0, 1, 4,
4, 4, 2, 0, 0, 4, 0, 4, 3, 3, 2, 2, 2, 3, 3, 2, 2, 4, 0, 3, 3, 3,
3, 0, 3, 0, 0, 0, 0, 3, 2, 3, 4, 4, 3, 4, 0, 1, 0, 3, 0, 4, 4, 2,
1, 0, 1, 0, 4, 2, 1, 2, 1, 1, 4, 0, 4, 4, 0, 2, 3, 1, 0, 2, 1, 0,
4, 3, 4, 2, 3, 2, 0, 2, 2, 0, 0, 0, 4, 2, 0, 2, 0, 1, 2, 3, 2, 2,
3, 1, 4, 4, 0, 4, 3, 0, 0, 2, 3, 4, 4, 4, 3, 1, 3, 2, 0, 2, 2, 1,
4, 0, 4, 3, 1, 1, 3, 0, 1, 4, 4, 3, 1, 0, 2, 2, 2, 4, 4, 0, 2, 0,
2, 2, 1, 3, 4, 0, 4, 1, 4, 4, 3, 2, 3, 3, 2, 1, 1, 0, 2, 2, 3, 0,
0, 4, 0, 4, 4, 3, 0, 2, 3, 0, 0, 3, 4, 3, 4, 1, 3, 3, 1, 0, 4, 3,
3, 2, 4, 0, 2, 3, 3, 2, 1, 4, 4, 4, 0, 3, 1, 1, 4, 0, 2, 4, 3, 3,
4, 4, 2, 0, 3, 1, 1, 3, 1, 4, 4, 0, 0, 0, 3, 3, 4, 3, 0, 4, 0, 0,
3, 0, 2, 0, 0, 4, 0, 4, 2, 4, 1, 2, 4, 1, 3, 2, 1, 0, 4, 0, 4, 1,
4, 3, 0, 0, 2, 1, 2, 3], dtype=int64)
y_pred
array([[2.6148611e-05, 1.2884392e-06, 8.0136197e-06, 9.9993646e-01,
2.8027451e-05],
[1.1888630e-08, 1.9621881e-07, 6.0117927e-08, 9.9999917e-01,
4.2087538e-07],
[2.4368815e-06, 9.9999702e-01, 2.0465748e-07, 9.2730332e-08,
2.5044619e-07],
...,
[8.7212893e-04, 9.9891293e-01, 7.5106349e-05, 7.0842376e-05,
6.8954141e-05],
[1.2511186e-02, 5.9731454e-05, 9.8512655e-01, 3.0246837e-04,
2.0000227e-03],
[5.9550672e-07, 7.1766672e-06, 2.0012515e-06, 9.9999011e-01,
1.1376539e-07]], dtype=float32)
Your problem is caused by the presence of continuous-multioutput target values in y_test_arg or Y_pred. I think this error was generated in the below code:
y_test_arg=np.argmax(y_test,axis=1)
Y_pred = np.argmax(model.predict(x_test),axis=1)
It would help if you rounded your predictions in Y_pred before calculating classification_report.
You can see this question
I am using ImageDataGenerator (tensorflow version 2.5.0) to load in a number of jpg files that I am using for a classification system. I have specified the class_mode='categorical'. My images are originally RGB, but even though I am converting them to greyscale I don't think that should matter. However, when I call train_set.classes, the data I get is not one-hot encoded data, but it is sparse numerical data. Here is my ImageDataGenerator call:
def preprocessing_function(image):
neg = 1 - image
return neg
#image_path = sys.argv[1]
image_path = ''
train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
vertical_flip=True,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
preprocessing_function=preprocessing_function)
train_set = train_datagen.flow_from_directory(
os.path.join(image_path, 'endo_jpg/endo_256_2021_08_05/Training'),
target_size=(100,100),
batch_size=batch,
class_mode='categorical',
color_mode='grayscale')
Upon calling the flow_from_directory method, I am returned what I expect:
Found 625 images belonging to 4 classes.
Calling train_set.classes, I am returned a long list of integers, not one hot encoded data:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3])
I can force the data to be one hot encoded by using:
train_set.classes = tensorflow.keras.utils.to_categorical(train_set.classes), but then I can't train with the data generator.
I think there is a problem with my specifying class_mode='categorical', but I have no idea why. I followed the example in the documentation (here), but calling categorical returns a sparse label.
Since you are using class_mode='categorical' you don't have to manually convert the labels to one hot encoded vectors using to_categorical().
The Generator will return labels as categorical automatically.
Simply calling train_set[0] clearly shows me the images and the labels. The printed labels are one hot encoded based on my code.
I have a quick question about the numpy unique function. I want to return the unique column values for each row
import numpy as np
a = np.array([[3, 2, 3, 2, 1, 3, 1, 2, 1, 3, 1, 2, 2, 2, 3, 3],
[3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 1, 2, 1, 2, 1],
[3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 2, 3, 1, 1]]) # a.shape is (3,16)
np.unique(a)
array([1, 2, 3]) # not what I want
np.unique(a,axis=1)
array([[1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3],
[2, 3, 1, 1, 2, 2, 3, 1, 2, 2, 3],
[2, 3, 2, 3, 2, 3, 2, 1, 1, 2, 3]]) # also not what I want, and I'm not even sure what its doing
np.apply_along_axis(np.unique,1,a)
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]]) # this is what I want
The problem is that I also want to use other features of np.unqiue, like returning index values. Can anyone help me to get np.unique to work by itself?
You can loop over rows and collect unique values:
import numpy as np
a = np.array([[3, 2, 3, 2, 1, 3, 1, 2, 1, 3, 1, 2, 2, 2, 3, 3],
[3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 1, 2, 1, 2, 1],
[3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 2, 3, 1, 1]])
arr = np.empty((0,3), int)
for row in a:
arr = np.append(arr, np.array([np.unique(a)]), axis=0)
Output:
[[1 2 3]
[1 2 3]
[1 2 3]]
numpy will not be able to return a matrix with rows of different sizes. your example has exactly 3 distinct values per row which makes np.apply_along_axis work but if you had a value of 4 in one of the rows or only 1s and 2s on a row it would fail.
To obtain what you are looking for you will need to use a normal Python list as the result. You can build it using a list comprehension:
import numpy as np
a = np.array([[1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1],
[3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 1, 2, 1, 2, 1],
[3, 3, 3, 2, 3, 3, 4, 2, 2, 2, 3, 2, 2, 3, 1, 1]])
r = [ np.unique(row) for row in a ]
print(r)
# [array([1, 2]), array([1, 2, 3]), array([1, 2, 3, 4])]
r = [ np.unique(row,return_index=True)for row in a ]
print(r)
# [(array([1, 2]), array([0, 1])),
# (array([1, 2, 3]), array([11, 1, 0])),
# (array([1, 2, 3, 4]), array([14, 3, 0, 6]))]
One thing you could do is build a mask of the values that are the first of their kind on each row. This can be done using numpy.
Here's one way to do it (hopefully, numpy experts could suggest something less convoluted):
np.sum(np.cumsum(np.cumsum(a==np.unique(a)[:,None,None],axis=2),axis=2)==1,axis=0)
array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]])
Such a mask offers many processing options such as finding indices of the first occurrence on each line (using np.argwhere), erasing/assigning first or subsequent occurrences, and more.
I have a DataFrame containing density values. I'd like to group by the 'hour' value, bin the densities, and add a new column to my original df, containing the bin number. This is failing, however:
df = pd.DataFrame({
'hours': np.random.randint(0, 24, 10000),
'density' : np.random.sample(10000)})
def func(df):
""""calculates equal intervals of a series or array"""
intervals = pysal.esda.mapclassify.Equal_Interval(df.density, 5)
# yb is an ndarray containing the bin indices, 0 - 4 in this case
return intervals.yb
df['bins'] = df.groupby(df.hours).transform(func)
Gives AssertionError: length of join_axes must not be equal to 0
If I just group the object and apply the interval function, it looks like this:
grp = df.groupby(df.hours).apply(func)
grp
Out[106]:
hours
0 [2, 4, 3, 4, 0, 4, 2, 2, 0, 1, 0, 0, 2, 2, 0, ...
1 [4, 1, 0, 4, 0, 2, 2, 3, 2, 3, 0, 3, 4, 3, 2, ...
2 [4, 1, 0, 2, 3, 4, 1, 1, 0, 3, 4, 4, 2, 4, 0, ...
3 [3, 0, 0, 4, 0, 0, 0, 1, 2, 2, 0, 2, 2, 2, 1, ...
4 [0, 1, 1, 2, 1, 3, 1, 3, 2, 2, 1, 4, 0, 4, 2, ...
5 [2, 0, 2, 1, 3, 1, 1, 0, 4, 4, 2, 1, 4, 1, 2, ...
6 [1, 2, 3, 3, 3, 2, 4, 1, 2, 1, 2, 0, 3, 2, 0, ...
7 [3, 0, 3, 1, 3, 1, 2, 1, 4, 2, 1, 2, 1, 1, 1, ...
8 [0, 1, 4, 3, 0, 1, 0, 0, 1, 0, 2, 1, 0, 1, 1, ...
9 [4, 2, 0, 4, 1, 3, 2, 3, 4, 1, 1, 4, 4, 4, 4, ...
10 [4, 4, 3, 3, 1, 2, 3, 0, 2, 4, 2, 4, 0, 2, 2, ...
11 [0, 1, 3, 0, 1, 1, 1, 1, 2, 1, 2, 0, 3, 3, 4, ...
12 [3, 1, 1, 0, 4, 4, 3, 0, 1, 2, 1, 1, 4, 2, 0, ...
13 [1, 1, 0, 2, 0, 1, 4, 1, 2, 2, 3, 1, 2, 0, 3, ...
14 [2, 4, 0, 2, 1, 2, 0, 4, 4, 2, 3, 4, 2, 1, 1, ...
15 [2, 4, 3, 4, 1, 0, 3, 1, 2, 0, 3, 4, 2, 2, 3, ...
16 [0, 4, 2, 3, 3, 4, 0, 3, 2, 0, 1, 0, 0, 2, 0, ...
17 [3, 1, 4, 4, 0, 4, 1, 0, 4, 3, 3, 2, 3, 1, 4, ...
18 [4, 3, 0, 2, 4, 2, 2, 0, 2, 2, 1, 2, 1, 0, 1, ...
19 [3, 0, 3, 1, 1, 0, 1, 1, 3, 3, 2, 3, 4, 0, 0, ...
20 [3, 0, 1, 4, 0, 0, 4, 2, 4, 2, 2, 0, 4, 0, 0, ...
21 [4, 2, 3, 3, 1, 2, 0, 4, 2, 0, 2, 2, 1, 2, 2, ...
22 [0, 4, 1, 1, 3, 1, 4, 1, 3, 4, 4, 0, 4, 4, 4, ...
23 [4, 1, 2, 0, 2, 0, 0, 0, 2, 3, 1, 1, 3, 0, 1, ...
dtype: object
Is there a standard way to join or merge values calculated from a grouped object, or should I be using transform differently?
Try to transform on column like this -
df['bins'] = df.groupby(df.hours).density.transform(func)
Note: func needs to be changed to receive Series as arg