Related
I have clustered some data points twice and obtained four clusters (A=1,B=2,C=3,D=4) for both of them. I want to assess the overall stability of the clustering, but also assess each cluster individually (cluster A for the first result(A1) vs cluster A for the second result(A2), B1 vs B2, C1 vs C2, and D1 vs D2).
For the overall stability, I am using the adjusted rand index (ARI) function and have no problem. Nevertheless, when I want to assess ex. A1 vs A2, I don't really know how I should proceed.
The clustering results are the following:
c1 <- c(1, 2, 3, 2, 1, 3, 4, 3, 2, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 2, 3, 2, 3, 2, 1, 3, 4, 4, 4, 4, 3, 2, 3, 2, 3, 1, 3, 2, 1, 2, 3, 4, 3, 2, 1, 4, 3, 2, 2, 2, 3, 4, 3, 3, 3, 2, 1, 1, 1, 2)
c2 <- c(1, 2, 4, 4, 1, 3, 4, 2, 2, 2, 3, 4, 1, 2, 1, 2, 3, 4, 3, 2, 1, 2, 2, 4, 2, 3, 2, 3, 2, 1, 3, 3, 4, 3, 4, 3, 2, 3, 2, 3, 1, 1, 1, 1, 2, 3, 4, 3, 2, 1, 4, 3, 2, 2, 2, 3, 4, 3, 3, 3, 2, 1, 1, 1, 2)
Is there any good strategy to look between each type of cluster (ex. A1 vs A2)?
Suggestions that require R or python syntax are accepted.
Thanks in advance!
I'm using PyTorch to train a model.
My validation_labels (ground truth labels) consists of the following values:
tensor([2, 0, 2, 2, 2, 0, 1, 1, 0, 2, 2, 0, 1, 2, 1, 2, 1, 1, 0, 1, 2, 2, 1, 2,
2, 2, 2, 1, 2, 1, 0, 2, 0, 2, 2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2,
1, 1, 0, 2, 1, 0, 2, 2, 2, 2, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 2, 2,
2, 2, 1, 2, 0, 2, 0, 1, 1, 2, 2, 0, 2, 2, 1, 1, 2, 0, 2, 2, 2, 2, 2, 0,
2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 0, 0, 0, 1, 0, 2, 1, 2, 1, 2, 0, 2, 1, 2,
1, 0, 1, 2, 2, 2, 2, 0, 2, 1, 0, 2, 1, 2, 1, 1, 0, 1, 2, 2, 2, 2, 1, 0,
1, 1, 0, 2, 2, 1, 2, 2, 0, 1, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 2, 2, 2, 2,
2, 1, 2, 2, 1, 0, 2, 1, 2, 2, 2, 2, 0, 2, 0, 0, 2, 1, 2, 0, 0, 2, 0, 2,
0, 0, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 0, 1, 2, 1, 2, 0, 0, 1, 1, 1, 2,
1, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 1, 0, 2, 1, 2, 2, 0, 2, 2, 0, 1, 0,
1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 1, 0, 1, 2, 1, 0, 1, 2,
2, 2, 1, 2, 2, 2, 1, 0, 1, 2, 2, 0, 2, 2, 2, 0, 1, 2, 0, 2, 2, 0, 0, 1,
1, 1, 1, 1, 1, 2, 0, 2, 1, 0, 2, 1, 0, 2, 2, 2, 2, 2, 1, 1, 0, 2, 2, 2,
2, 2, 0, 2, 0, 2, 2, 2, 1, 1, 0, 2, 1, 0, 0, 2, 0, 2, 1, 2, 0, 2, 2, 1,
1, 1, 2, 2, 2, 0, 1, 0, 1, 2, 2, 2, 2, 2, 0, 1, 2, 0, 0, 0, 2, 1, 2, 0,
2, 1, 2, 1, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 1, 1, 2, 2, 2,
2, 0, 2, 2, 0, 2, 0, 1, 1, 0, 2, 0, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 0, 0,
2, 2, 2, 2, 2, 0, 2, 2, 0, 1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 1, 2, 1,
2, 2, 2, 2, 1, 1, 1, 0, 0, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 0, 0, 0,
0, 1, 1, 0, 0], device='mps:0')
But, using the below code to generate a DataLoader results in all the validation_labels being converted to '2's.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
for step, batch in enumerate(validation_dataloader):
batch = tuple(t.to(device) for t in batch)
eval_data, eval_masks, eval_labels = batch
print(eval_labels)
The eval labels get printed as:
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], device='mps:0')
Why are all the labels being changed to '2'? I'm not able to find out what is wrong with my code. Could someone tell me why this happens and what I should do about it?
This happened to me because the folder I was passing to the dataloder was the parent folder of the actual training data. i.e. Data was present in training/training. By removing the outer layer, the dataloder was able to read the labels correctly.
I have the following list in python3:
a = [1, 1, 1, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4, 4, 4, 3, 3, 3, 3]
I want to have something like this as an output:
b = [1, 2, 3, 1, 2, 4, 3]
How can I do that? I have tried a = set(a) and other methods for getting rid of duplicates. But they remove all duplicates.
If you are willing to use a module, you can use itertools.groupby:
from itertools import groupby
a = [1, 1, 1, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4, 4, 4, 3, 3, 3, 3]
output = [k for k, _ in groupby(a)]
print(output) # [1, 2, 3, 1, 2, 4, 3]
This would work:
a = [1, 1, 1, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4, 4, 4, 3, 3, 3, 3]
b = [a[0]]
for k in a:
if k!=b[-1]:
b.append(k)
print(b)
If you comfortable with C, you may like this style:
a = [1, 1, 1, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4, 4, 4, 3, 3, 3, 3]
b=len(a)
i=0
last_one=None
while i<b:
if a[i]==last_one:
del a[i]
b-=1
continue
else:
last_one=a[i]
i+=1
print(a)
This question already has answers here:
permutations with unique values
(20 answers)
Closed 4 years ago.
I have a problem that I'm trying to solve with itertools product. It is a permutations problem, but the values are not unique, example:
list = [1,1,2,2,3,3]
#result should be
[1,2,1,2,3,3], [2,2,1,1,3,3], .....
I tried using set(itertools.permutations(list))which is the straightforward answer, but the processing time for it is too much for different values and long lists.
I tried also x = itertools.product(set(list),repeat=len(list)) then cleaning out x from the items that don't fulfill the original list value counts (i.e. the generated lists must have two 1s, two 2s, and two 3s), this solution is more quick but this answer throws MemoryError with large numbers because it tries storing the output in the memory then doing the processing on it.
I tried also looping through the product result (i.e for i in itertools.product(set(list),repeat=len(list)) and selecting which iterations to store in and which iterations to throw away) and this solution solves the memory error problem, but is almost as slow as the first one, in which the code could run for hours.
Does anyone have any suggestions how would be a more efficient way to solve this problem?
Here's how I might approach the problem. Take each unique starting value and then all the unique combinations from the values with the starting value removed. It avoids generating any duplicates but only stores the unique values used in each position so worst case an n element list is n-squared in memory.
import itertools
def unique_permutations(values):
if not values:
yield []
return
seen = set()
for idx, first in enumerate(values):
if first in seen:
continue
seen.add(first)
rest = values[:idx] + values[idx+1:]
for perm in unique_permutations(rest):
yield [first] + perm
values = [1, 1, 2, 2, 3, 3]
for perm in unique_permutations(values):
print(perm)
Output:
[1, 1, 2, 2, 3, 3]
[1, 1, 2, 3, 2, 3]
[1, 1, 2, 3, 3, 2]
[1, 1, 3, 2, 2, 3]
[1, 1, 3, 2, 3, 2]
[1, 1, 3, 3, 2, 2]
[1, 2, 1, 2, 3, 3]
[1, 2, 1, 3, 2, 3]
[1, 2, 1, 3, 3, 2]
[1, 2, 2, 1, 3, 3]
[1, 2, 2, 3, 1, 3]
[1, 2, 2, 3, 3, 1]
[1, 2, 3, 1, 2, 3]
[1, 2, 3, 1, 3, 2]
[1, 2, 3, 2, 1, 3]
[1, 2, 3, 2, 3, 1]
[1, 2, 3, 3, 1, 2]
[1, 2, 3, 3, 2, 1]
[1, 3, 1, 2, 2, 3]
[1, 3, 1, 2, 3, 2]
[1, 3, 1, 3, 2, 2]
[1, 3, 2, 1, 2, 3]
[1, 3, 2, 1, 3, 2]
[1, 3, 2, 2, 1, 3]
[1, 3, 2, 2, 3, 1]
[1, 3, 2, 3, 1, 2]
[1, 3, 2, 3, 2, 1]
[1, 3, 3, 1, 2, 2]
[1, 3, 3, 2, 1, 2]
[1, 3, 3, 2, 2, 1]
[2, 1, 1, 2, 3, 3]
[2, 1, 1, 3, 2, 3]
[2, 1, 1, 3, 3, 2]
[2, 1, 2, 1, 3, 3]
[2, 1, 2, 3, 1, 3]
[2, 1, 2, 3, 3, 1]
[2, 1, 3, 1, 2, 3]
[2, 1, 3, 1, 3, 2]
[2, 1, 3, 2, 1, 3]
[2, 1, 3, 2, 3, 1]
[2, 1, 3, 3, 1, 2]
[2, 1, 3, 3, 2, 1]
[2, 2, 1, 1, 3, 3]
[2, 2, 1, 3, 1, 3]
[2, 2, 1, 3, 3, 1]
[2, 2, 3, 1, 1, 3]
[2, 2, 3, 1, 3, 1]
[2, 2, 3, 3, 1, 1]
[2, 3, 1, 1, 2, 3]
[2, 3, 1, 1, 3, 2]
[2, 3, 1, 2, 1, 3]
[2, 3, 1, 2, 3, 1]
[2, 3, 1, 3, 1, 2]
[2, 3, 1, 3, 2, 1]
[2, 3, 2, 1, 1, 3]
[2, 3, 2, 1, 3, 1]
[2, 3, 2, 3, 1, 1]
[2, 3, 3, 1, 1, 2]
[2, 3, 3, 1, 2, 1]
[2, 3, 3, 2, 1, 1]
[3, 1, 1, 2, 2, 3]
[3, 1, 1, 2, 3, 2]
[3, 1, 1, 3, 2, 2]
[3, 1, 2, 1, 2, 3]
[3, 1, 2, 1, 3, 2]
[3, 1, 2, 2, 1, 3]
[3, 1, 2, 2, 3, 1]
[3, 1, 2, 3, 1, 2]
[3, 1, 2, 3, 2, 1]
[3, 1, 3, 1, 2, 2]
[3, 1, 3, 2, 1, 2]
[3, 1, 3, 2, 2, 1]
[3, 2, 1, 1, 2, 3]
[3, 2, 1, 1, 3, 2]
[3, 2, 1, 2, 1, 3]
[3, 2, 1, 2, 3, 1]
[3, 2, 1, 3, 1, 2]
[3, 2, 1, 3, 2, 1]
[3, 2, 2, 1, 1, 3]
[3, 2, 2, 1, 3, 1]
[3, 2, 2, 3, 1, 1]
[3, 2, 3, 1, 1, 2]
[3, 2, 3, 1, 2, 1]
[3, 2, 3, 2, 1, 1]
[3, 3, 1, 1, 2, 2]
[3, 3, 1, 2, 1, 2]
[3, 3, 1, 2, 2, 1]
[3, 3, 2, 1, 1, 2]
[3, 3, 2, 1, 2, 1]
[3, 3, 2, 2, 1, 1]
I have multidimensional data in a pandas data frame with one variable indicating class. For example here is my attempt with a poor-maps heatmap scatter plot:
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.cm import get_cmap
nrows=1000
df=pd.DataFrame([[random.random(), random.random()]+[random.randint(0, 1)] for _ in range(nrows)],
columns=list("ABC"))
bins=np.linspace(0, 1, 20)
df["Abin"]=[bins[i-1] for i in np.digitize(df.A, bins)]
df["Bbin"]=[bins[i-1] for i in np.digitize(df.B, bins)]
g=df.ix[:,["Abin", "Bbin"]+["C"]].groupby(["Abin", "Bbin"])
data=g.agg(["sum", "count"])
data.reset_index(inplace=True)
data["classratio"]=data[("C", "sum")]/data[("C","count")]
plt.scatter(data.Abin, data.Bbin, c=data.classratio, cmap=get_cmap("RdYlGn_r"), marker="s")
I'd like to plot class densities over binned features. Now I used np.digitize for binning and some complicating Python hand-made density calculation to plot a heatmap.
Surely, this can be done more compactly with Pandas (pivot?)? Do you know a neat way to bin the two features (for example 10 bins on the interval 0...1) and then plot a class density heatmap where color indicates the ratio of 1's to total rows within this 2D-bin?
Yep, it can be done in a very concise way using the build in cut function:
In [65]:
nrows=1000
df=pd.DataFrame([[random.random(), random.random()]+[random.randint(0, 1)] for _ in range(nrows)],
columns=list("ABC"))
In [66]:
#This does the trick.
pd.crosstab(np.array(pd.cut(df.A, 20)), np.array(pd.cut(df.B, 20))).values
Out[66]:
array([[2, 2, 2, 2, 7, 2, 3, 5, 1, 4, 2, 2, 1, 3, 2, 1, 7, 2, 4, 2],
[1, 2, 4, 2, 0, 3, 3, 3, 1, 1, 2, 1, 4, 3, 2, 1, 1, 2, 2, 1],
[0, 4, 1, 3, 1, 3, 2, 5, 2, 3, 1, 1, 1, 4, 2, 3, 6, 5, 2, 2],
[5, 2, 3, 2, 2, 1, 3, 2, 4, 0, 3, 2, 0, 4, 3, 2, 1, 3, 1, 3],
[2, 2, 4, 1, 3, 2, 2, 4, 1, 4, 3, 5, 5, 2, 3, 3, 0, 2, 4, 0],
[2, 3, 3, 5, 2, 0, 5, 3, 2, 3, 1, 2, 5, 4, 4, 3, 4, 3, 6, 4],
[3, 2, 2, 4, 3, 3, 2, 0, 0, 4, 3, 2, 2, 5, 4, 0, 1, 2, 2, 3],
[0, 0, 4, 4, 3, 2, 4, 6, 4, 2, 0, 5, 2, 2, 1, 3, 4, 4, 3, 2],
[3, 2, 2, 3, 4, 2, 1, 3, 1, 3, 4, 2, 4, 3, 2, 3, 2, 3, 4, 4],
[0, 1, 1, 4, 1, 4, 3, 0, 1, 1, 1, 2, 6, 4, 3, 5, 3, 3, 1, 4],
[2, 2, 4, 1, 3, 4, 1, 2, 1, 3, 3, 3, 1, 2, 1, 5, 2, 1, 4, 3],
[0, 0, 0, 4, 2, 0, 2, 3, 2, 2, 2, 4, 4, 2, 3, 2, 1, 2, 1, 0],
[3, 3, 0, 3, 1, 5, 1, 1, 2, 5, 6, 5, 0, 0, 3, 2, 1, 5, 7, 2],
[3, 3, 2, 1, 2, 2, 2, 2, 4, 0, 1, 3, 3, 1, 5, 6, 1, 3, 2, 2],
[3, 0, 3, 4, 3, 2, 1, 4, 2, 3, 4, 0, 5, 3, 2, 2, 4, 3, 0, 2],
[0, 3, 2, 2, 1, 5, 1, 4, 3, 1, 2, 2, 3, 5, 1, 2, 2, 2, 1, 2],
[1, 3, 2, 1, 1, 4, 4, 3, 2, 2, 5, 5, 1, 0, 1, 0, 4, 3, 3, 2],
[2, 2, 2, 1, 1, 3, 1, 6, 5, 2, 5, 2, 3, 4, 2, 2, 1, 1, 4, 0],
[3, 3, 4, 7, 0, 2, 6, 4, 1, 3, 4, 4, 1, 4, 1, 1, 2, 1, 3, 2],
[3, 6, 3, 4, 1, 3, 1, 3, 3, 1, 6, 2, 2, 2, 1, 1, 4, 4, 0, 4]])
In [67]:
abins=np.linspace(df.A.min(), df.A.max(), 21)
bbins=np.linspace(df.B.min(), df.B.max(), 21)
Z=pd.crosstab(np.array(pd.cut(df.ix[df.C==1, 'A'], abins)),
np.array(pd.cut(df.ix[df.C==1, 'B'], bbins)), aggfunc=np.mean).div(
pd.crosstab(np.array(pd.cut(df.A, abins)),
np.array(pd.cut(df.B, bbins)), aggfunc=np.mean)).values
Z = np.ma.masked_where(np.isinf(Z),Z)
x=np.linspace(df.A.min(), df.A.max(), 20)
y=np.linspace(df.B.min(), df.B.max(), 20)
X,Y=np.meshgrid(x, y)
plt.contourf(X, Y, Z, vmin=0, vmax=1)
plt.colorbar()
plt.pcolormesh(X, Y, Z, vmin=0, vmax=1)
plt.colorbar()