Show pictures which represent kmeans cluster center (Scikit learn)

Show pictures which represent kmeans cluster center (Scikit learn) - python

I have a dataset which was clustered by kmeans. A Friend told me that i can show the pictures which represent each cluster center. He gave's me this short example code:
for i in xrange(len(np.unique(labels))):
this_cluster = np.where(labels == i)[0]
fig, ax = plt.subplots(len(this_cluster))
for im in this_cluster:
ax.imshow(images[im])
I've tried this but it's not working...for e.g I have a small dataset which contains 20 pics. Kmeans returns 50 Centers for this 20 pics. So my np.unique(labels) with (labels = kmeans.labels_?!) is equal to 50...so the "i" runs from 0 to 49...my first "this_cluster" looks like this one:
[ 4 8 18 19 35 37 50 135 140 146 156 214 371 506 563
586 594 887 916 989 993 1021 1061 1105 1121 1128 1405 1409 1458 1466
1481 1484 1505 1572 1573 1620 1784 1817 1835 1854 1945 1955 2004 2006 2054
2135 2204 2245 2319 2321 2343 2391 2410 2414 2486 2502 2530 2594 2624 2629
2825 2828 2833 2911 3017 3097 3245 3246 3298 3347 3493 3568 3627 3677 3701
3789 3866 3941 3944 3969 4022 4115 4214 4215 4432 4527 4559 4594 4645 4668
4699 4785 4797 4802 4807 4831 4892 4905 4921 4929 4932 5076 5178 5233 5249
5318 5463 5508 5571 5621 5644 5661 5678 5690 5727 5736 5737 5755 5777 5961
6088 6089 6107 6197 6353 6487 6500 6515 6565 6575 6601 6706 6749]
so if the next for begans it breaks by i=4 because there are only 20 pictures and images[im] with im>20 will give me a out of bounds...i think "this_clusters" are the Descriptors taken from the dataset which are compute by kmeans and set to cluster 0...so this can't be right?! or am i on the wrong way. Maybe someone could help me.
EDIT*:
create sets
X_train_pos, X_test_pos, X_dataset_train_pos, X_dataset_test_pos = train_test_split(X_desc_pos, dataset_pos, test_size=0.5)
X_train_neg, X_test_neg, X_dataset_train_neg, X_dataset_test_neg = train_test_split(X_desc_neg, dataset_neg, test_size=0.5)
# merge list of array descriptor into descriptor list
x1 = numpy.vstack(X_train_pos)
x2 = numpy.vstack(X_train_neg)
# compute cluster centers
kmeans, n_clusters = dataset_module.create_center_data(numpy.vstack((x1,x2)),numpy.vstack((X_dataset_train_pos,X_dataset_train_neg)))
compute kmeans
def create_center_data(data,dataset):
n_clusters = len(data)
n_clusters = math.sqrt(n_clusters/2)
n_clusters = int(n_clusters)
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
kmeans.fit(data)
numpy.set_printoptions(threshold=numpy.nan)
labels = kmeans.labels_
for i in xrange(len(numpy.unique(labels))):
this_cluster = numpy.where(labels == i)[0]
fig, ax = plt.subplots(len(this_cluster))
for im in this_cluster:
pic = open(dataset[im], "rb")
ax.imshow(pic)
return kmeans, n_clusters
data looks like:
[[ 36. 1. 9. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 1.]
...,
[ 49. 26. 0. ..., 12. 4. 5.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 3. 8. ..., 0. 0. 3.]]
data = all descriptors of the 20 pictures...
dataset is a numpy array with paths to pictures
regards
Linda

If you cluster SIFT descriptors, your cluster means will look like sift descriptors, not like images.
I believe you were thinking of EigenFaces, but that has little to do with k-means.

Related

Python - Plot linear percentage graph

I have this numpy array:
[
[ 0 0 0 0 0 0 2 0 2 0 0 1 26 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 4]
[21477 61607 21999 17913 22470 32390 11987 41977 81676 20668 17997 15278 46281 19884]
[ 5059 13248 5498 3866 2144 6161 2361 8734 16914 3724 4614 3607 11305 2880]
[ 282 1580 324 595 218 525 150 942 187 232 430 343 524 189]
[ 1317 6416 1559 882 599 2520 525 2560 19197 729 1391 1727 2044 1198]
]
I've just created logarithm heatmap which works as intended. However I would like to create another heat map that would represent linear scale across rows, and shows each position in matrix corresponding percentage value, while sum of row would give 100%. Without using seaborn or pandas
Something like this:

Here you go:
import matplotlib.pyplot as plt
import numpy as np
a = np.array([[0,0,0,0,0,0,2,0,2,0,0,1,26,0],
[0,0,0,0,0,0,0,0,0,0,0,0,0,4],
[21477,61607,21999,17913,22470,32390,11987,41977,81676,20668,17997,15278,46281,19884],
[5059,13248,5498,3866,2144,6161,2361,8734,16914,3724,4614,3607,11305,2880],
[282,1580,324,595,218,525,150,942,187,232,430,343,524,189],
[1317,6416,1559,882,599,2520,525,2560,19197,729,1391,1727,2044,1198]])
# normalize
normalized_a = a/np.sum(a,axis=1)[:,None]
# plot
plt.imshow(normalized_a)
plt.show()

Padas, how do I create a dataset where the column is a multidimensional array?

This is what my data looks like
print(data)
>
array([[ 0.369 , -0.3396 , 0.1017 , ..., 0.2164 , -0.11163, -0.6025 ],
[ 0.548 , -0.2668 , -0.1425 , ..., -0.3198 , -0.599 , 0.04703],
[ 0.761 , -0.2515 , 0.02998, ..., 0.04663, -0.3276 , -0.1771 ],
...,
[ 0.2148 , -0.492 , -0.03586, ..., 0.1157 , -0.299 , -0.12 ],
[ 0.775 , -0.2622 , -0.1372 , ..., 0.356 , -0.2673 , -0.1897 ],
[ 0.775 , -0.2622 , -0.1372 , ..., 0.356 , -0.2673 , -0.1897 ]],
dtype=float16)
I am trying to convert this to a column in pandas using this
dataset = pd.DataFrame(data, index=[0])
print(dataset)
But I get this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1652
-> 1653 mgr = BlockManager(blocks, axes)
1654 mgr._consolidate_inplace()
7 frames
ValueError: Shape of passed values is (267900, 768), indices imply (1, 768)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
1689 raise ValueError("Empty data passed with indices specified.")
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
1693
ValueError: Shape of passed values is (267900, 768), indices imply (1, 768)
It looks like the tricky part is having a whole array as a row entry.
There was a suggestion of
"remove index dataset = pd.DataFrame(data) "
However , this does not give the desired result. Here's what the result looks like
dataset = pd.DataFrame(embeds16[:,0])
dataset.head()
0 1 2 3 4 5 6 7 8 9 ... 758 759 760 761 762 763 764 765 766 767
0 0.368896 -0.339600 0.101685 0.679199 -0.201904 -0.247192 -0.032776 -0.057098 0.287354 -0.356689 ... 0.064453 0.548340 -0.047729 -0.615723 -0.225464 -0.071106 -0.254395 0.216431 -0.111633 -0.602539
1 0.547852 -0.266846 -0.142456 1.327148 -0.135254 -0.376953 -0.221069 -0.273926 -0.099609 -0.146118 ... 0.138184 0.446777 -0.577637 0.051300 0.187378 0.171021 0.079163 -0.319824 -0.599121 0.047028
2 0.761230 -0.251465 0.029984 1.008789 -0.311279 -0.419922 -0.015869 -0.019196 0.016174 -0.284424 ... 0.152100 0.452881 -0.265381 -0.272949 0.029831 0.002472 0.186646 0.046631 -0.327637 -0.177124
3 0.690918 -0.374756 -0.008820 0.869141 -0.496582 -0.546875 0.060028 0.139893 -0.032471 -0.120361 ... 0.040314 0.391113 -0.420898 -0.342285 0.191650 0.350830 0.083130 0.028137 -0.488525 -0.157349
4 0.583008 -0.342529 -0.073608 0.683105 -0.071777 -0.390137 -0.174316 0.154541 0.170410 -0.184692 ... 0.326416 0.450928 0.083923 -0.331299 -0.207520
I am looking to have the entire array in a single column, not spread over multiple

Do you mean
pd.Series(a.tolist())
Update
pd.Series([x for x in a])

Python numpy : Matrix Inverses give unprecise results when multiplied

Alright, so I have 3 numpy matrices :
m1 = [[ 3 2 2 ... 2 2 3]
[ 3 2 2 ... 3 3 2]
[500 501 502 ... 625 626 627]
...
[623 624 625 ... 748 749 750]
[624 625 626 ... 749 750 751]
[625 626 627 ... 750 751 752]]
m2 = [[ 3 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]
[ 2 3 500 ... 623 624 625]
...
[ 2 2 500 ... 623 624 625]
[ 2 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]]
m3 = [[ 813 827 160500 ... 199983 200304 200625]
[ 830 843 164000 ... 204344 204672 205000]
[ 181317 185400 36064000 ... 44935744 45007872 45080000]
...
[ 221046 225867 43936000 ... 54744256 54832128 54920000]
[ 221369 226196 44000000 ... 54824000 54912000 55000000]
[ 221692 226525 44064000 ... 54903744 54991872 55080000]]
m1, m2 and m3 are very large square matrices (those examples are 128x128, but they can go up to 2048x2048). Also m1*m2=m3.
My goal is to obtain m2 by using only m1 and m3. Someone told me this was possible, as m1*m2=m3 implies that (m1**-1) * m3 = m2 (I believe it was that, please correct me if I'm wrong) ; so I calculated the inverse of m1 :
m1**-1 = [[ 7.70884284e-01 -8.13188394e-01 -1.65131146e+13 ... -2.49697170e+12
-7.70160676e+12 -4.13395320e+13]
[-3.38144598e-01 2.54532610e-01 1.01286404e+13 ... -3.64296085e+11
2.60327813e+12 2.41783491e+13]
[ 1.77721050e-01 -3.54566231e-01 -5.00564604e+12 ... 5.82415184e+10
-5.98354744e+11 -1.29817153e+13]
...
[-6.56772812e-02 1.54498025e-01 3.21826474e+12 ... 2.61432526e+11
1.14203762e+12 3.61036457e+12]
[ 5.82732587e-03 -3.44252762e-02 -4.79430664e+11 ... 5.10855381e+11
-1.07679881e+11 -1.71485373e+12]
[ 6.55360708e-02 -8.24446025e-02 -1.19618881e+12 ... 4.45713678e+11
-3.48073716e+11 -4.89344092e+12]]
The result looked rather messy so I ran a test and multiplied m1**-1 and m1 to see if it worked :
(m1**-1)*m1 = [[-125.296875 , -117.34375 , -117.390625 , ..., -139.15625 ,
-155.203125 , -147.25 ],
[ 483.1640625 , 483.953125 , 482.7421875 , ..., 603.796875 ,
590.5859375 , 593.375 ],
[-523.22851562, -522.36328125, -523.49804688, ..., -633.07421875,
-635.20898438, -637.34375 ],
...,
[ 10.58691406, 11.68945312, 10.29199219, ..., 14.40429688,
13.00683594, 11.609375 ],
[ -5.32177734, -5.47949219, -4.63720703, ..., -5.28613281,
-5.31884766, -5.6015625 ],
[ -4.93554688, -3.58984375, -3.24414062, ..., -8.72265625,
-5.37695312, -8.03125 ]]
The result is different from the one expected (identity matrix). My guess is that m1 is too big, causing numerical imprecision. But if that previous calculation to get an identity matrix doesn't work properly, then (m1**-1)*m3 surely won't (and it doesn't).
But I really can't decrease the matrix sizes for m1, m2 and m3 and in fact I'd like it to work with even bigger sizes (as said before, max size would be 2048x2048).
Would there be any way to be more precise with such calculations ? Is there an alternative that could work for bigger matrices ?

You are right, inverting a large matrix can be inefficient and numerically unstable. Luckily, there are methods in linear algebra that solve this problem without requiring an inverse.
In this case, m2 = np.linalg.solve(m1, m3) works.

how to subset pandas dataframe with list

I have a pandas dataframe like this.
order_id latitude longitude
0 519 19.119677 72.905081
1 520 19.138250 72.913190
2 521 19.138245 72.913183
3 523 19.117662 72.905484
4 524 19.137793 72.913088
5 525 19.119372 72.893768
6 526 19.116275 72.892951
7 527 19.133430 72.913268
8 528 19.136800 72.917185
9 529 19.118284 72.901114
10 530 19.127193 72.914269
11 531 19.114269 72.904039
12 532 19.136292 72.913941
13 533 19.119075 72.895115
14 534 19.119677 72.905081
15 535 19.119677 72.905081
And one list
DB
Out[658]:
[['523'],
['526', '533'],
['527', '528', '532', '535'],
['530', '519'],
['529', '531', '525', '534'],
['520', '521', '524']]
Now I want to subset dataframe on list elements. There are 6 elements in a list and every element have a sublist of order_id. So, for every sub element I want corresponding latitude and longitude. And then I want to calculate haversine distance between each order_id location:
DB[2]
['527', '528', '532', '535']
Then I want to subset on main dataframe for latitude and longitude pairs. So it should return me an array like this:
array([[ 19.11824057, 72.8939447 ],
[ 19.1355074 , 72.9147978 ],
[ 19.11917348, 72.90518167],
[ 19.127193 , 72.914269 ]])
(Just an example not a correct lat long pairs).
I am doing following:
db_lat = []
db_long = []
for i in range(len(DB)):
l = len(DB[i])
for j in range(l):
db_lat.append(tsp_data_unique.latitude[tsp_data_unique['order_id'] ==
''.join(DB[i][j])])
db_long.append(tsp_data_unique.longitude[tsp_data_unique['order_id']
== ''.join(DB[i][j])])
But it gives me a list of all the lat and long present in DB. Here I am not able to distinguish which lat and long belong to which DB elements. So, for every DB elements (6 in my case) I want 6 arrays of lat and long. Please help.

First of all I would convert your int column to str to compare the dataframe with the values of the list:
df['order_id'] = df['order_id'].apply(str)
and then set the index on order_id:
df = df.set_index('order_id')
Then you can do something like:
pairs = df.loc[DB[2]].values
obtaining:
array([[ 19.13343 , 72.913268],
[ 19.1368 , 72.917185],
[ 19.136292, 72.913941],
[ 19.119677, 72.905081]])
EDIT:
Iterating over your list you can then:
In [93]: for i in range(len(DB)):
....: p = df.loc[DB[i]].values
....: print p
....:
[[ 19.117662 72.905484]]
[[ 19.116275 72.892951]
[ 19.119075 72.895115]]
[[ 19.13343 72.913268]
[ 19.1368 72.917185]
[ 19.136292 72.913941]
[ 19.119677 72.905081]]
[[ 19.127193 72.914269]
[ 19.119677 72.905081]]
[[ 19.118284 72.901114]
[ 19.114269 72.904039]
[ 19.119372 72.893768]
[ 19.119677 72.905081]]
[[ 19.13825 72.91319 ]
[ 19.138245 72.913183]
[ 19.137793 72.913088]]

This is how I solved it. Similar to what #Fabio has posted.
new_DB=[]
for i in range(len(DB)):
new_DB.append(tsp_data_unique[(tsp_data_unique['order_id']).isin(DB[i])]
[['latitude','longitude']].values)

How to return a list of numbers from a text file from a varying location

I'm trying to import a list of numbers from a results text file using python. Problem is the location of the list varies depending on the size of the data required. My part of my results file I'm interested in looks like this......
Tip Rotation (degrees)
Node , UR[x] , UR[y] , UR[z]
101 , 0.7978 , 0.7955 , -2.6071
102 , -0.7978 , -0.7955 , -2.6071
1303 , 0.7963 , 0.7693 , -2.6053
1304 , 0.7944 , 0.7150 , -2.5948
1305 , 0.7797 , 0.6616 , -2.5697
1306 , 0.7427 , 0.6074 , -2.5279
1307 , 0.6893 , 0.5509 , -2.4701
1308 , 0.6214 , 0.4914 , -2.3998
1309 , 0.5421 , 0.4272 , -2.3227
1310 , 0.4538 , 0.3585 , -2.2451
1311 , 0.3589 , 0.2848 , -2.1736
1312 , 0.2594 , 0.2070 , -2.1141
1313 , 0.1568 , 0.1256 , -2.0715
So essentially I want to make a list of the values in the UR[z] column, but the location of this block varies depending the on the amount of nodes etc. So I was thinking of say searching for UR[z] then returning the values in its column. Also I would like to do it with out specifying the number of nodes, but can do it if it simplifies it greatly.
Any help would be greatly appreciated! Cheers!
Just as a further note the rest of the results file is as follows!
SUMMARY OF RESULTS
Max tip rotation =,-2.6071,degrees
Min tip rotation =,-2.0493,degrees
Mean tip rotation =,-2.3655,degrees
Max tip displacement =,2.4013,mm
Min tip displacement =,1.0670,mm
Mean tip displacement = ,1.6051,mm
Max Tsai-Wu FC =,0.3904
Max Tsai-Hill FC =,0.3909
PLATE MODEL DATA
Length =,1000.00,mm
Width =,500.00,mm
Pretwist =, 65.00,degrees
Number of mesh elements =, 1250
Number of nodes =, 1326
VAT FIBRE PATH DEFINITION
Fibre path formula =
2
-6.096e-17 x + 1.421e-15 x - 90
Number of partitions = , 50
Partition No , Ply Angle (degrees)
1 ,-90.00
25 ,-90.00
50 ,-90.00
VAT Ply Orientations in each region
Region , Angle (degrees)
1 ,-90.0
2 ,-90.0
3 ,-90.0
4 ,-90.0
5 ,-90.0
6 ,-90.0
7 ,-90.0
8 ,-90.0
9 ,-90.0
10 ,-90.0
11 ,-90.0
12 ,-90.0
13 ,-90.0
14 ,-90.0
15 ,-90.0
16 ,-90.0
17 ,-90.0
18 ,-90.0
19 ,-90.0
20 ,-90.0
21 ,-90.0
22 ,-90.0
23 ,-90.0
24 ,-90.0
25 ,-90.0
26 ,-90.0
27 ,-90.0
28 ,-90.0
29 ,-90.0
30 ,-90.0
31 ,-90.0
32 ,-90.0
33 ,-90.0
34 ,-90.0
35 ,-90.0
36 ,-90.0
37 ,-90.0
38 ,-90.0
39 ,-90.0
40 ,-90.0
41 ,-90.0
42 ,-90.0
43 ,-90.0
44 ,-90.0
45 ,-90.0
46 ,-90.0
47 ,-90.0
48 ,-90.0
49 ,-90.0
50 ,-90.0
ADDITIONAL DATA
Tip Rotation (degrees)
Node , UR[x] , UR[y] , UR[z]
101 , 0.7978 , 0.7955 , -2.6071
102 , -0.7978 , -0.7955 , -2.6071
1303 , 0.7963 , 0.7693 , -2.6053
1304 , 0.7944 , 0.7150 , -2.5948
1305 , 0.7797 , 0.6616 , -2.5697
1306 , 0.7427 , 0.6074 , -2.5279
1307 , 0.6893 , 0.5509 , -2.4701
1308 , 0.6214 , 0.4914 , -2.3998
1309 , 0.5421 , 0.4272 , -2.3227
1310 , 0.4538 , 0.3585 , -2.2451
1311 , 0.3589 , 0.2848 , -2.1736
1312 , 0.2594 , 0.2070 , -2.1141
1313 , 0.1568 , 0.1256 , -2.0715
1314 , 0.0525 , 0.0421 , -2.0493
1315 , -0.0525 , -0.0421 , -2.0493
1316 , -0.1568 , -0.1256 , -2.0715
1317 , -0.2594 , -0.2070 , -2.1141
1318 , -0.3589 , -0.2848 , -2.1736
1319 , -0.4538 , -0.3585 , -2.2451
1320 , -0.5421 , -0.4272 , -2.3227
1321 , -0.6214 , -0.4914 , -2.3998
1322 , -0.6893 , -0.5509 , -2.4701
1323 , -0.7427 , -0.6074 , -2.5279
1324 , -0.7797 , -0.6616 , -2.5697
1325 , -0.7944 , -0.7150 , -2.5948
1326 , -0.7963 , -0.7693 , -2.6053
Tip Displacement (mm)
Node , U[x] , U[y] , U[z]
101 , 9.2757 , -4.6016 , 2.4013
102 , -9.2757 , 4.6016 , 2.4013
1303 , 8.4491 , -4.2173 , 2.2646
1304 , 7.6214 , -3.8331 , 2.1175
1305 , 6.8005 , -3.4481 , 1.9597
1306 , 5.9917 , -3.0625 , 1.8024
1307 , 5.2006 , -2.6792 , 1.6515
1308 , 4.4320 , -2.3003 , 1.5123
1309 , 3.6888 , -1.9278 , 1.3887
1310 , 2.9721 , -1.5627 , 1.2830
1311 , 2.2807 , -1.2054 , 1.1973
1312 , 1.6116 , -0.8552 , 1.1323
1313 , 0.9596 , -0.5107 , 1.0888
1314 , 0.3186 , -0.1698 , 1.0670
1315 , -0.3186 , 0.1698 , 1.0670
1316 , -0.9596 , 0.5107 , 1.0888
1317 , -1.6116 , 0.8552 , 1.1323
1318 , -2.2807 , 1.2054 , 1.1973
1319 , -2.9721 , 1.5627 , 1.2830
1320 , -3.6888 , 1.9278 , 1.3887
1321 , -4.4320 , 2.3003 , 1.5123
1322 , -5.2006 , 2.6792 , 1.6515
1323 , -5.9917 , 3.0625 , 1.8024
1324 , -6.8005 , 3.4481 , 1.9597
1325 , -7.6214 , 3.8331 , 2.1175
1326 , -8.4491 , 4.2173 , 2.2646
END OF RESULTS
I only want to looks at the tip rotation data section!

Without using any modules:
with open('filename') as f:
next(f) #skip first line
nodes = map(str.strip, next(f).split(','))
#find the index of `'UR[z]'` in `nodes` list
column = nodes.index('UR[z]')
#iterate over each line one by one
for line in f:
line = map(str.strip, line.split(','))
#return the desired column
print line[column]
output:
-2.6071
-2.6071
-2.6053
-2.5948
-2.5697
-2.5279
-2.4701
-2.3998
-2.3227
-2.2451
-2.1736
-2.1141
-2.0715
Update:
with open('filename') as f:
column = None
for line in f:
if 'UR[z]' in line:
line = map(str.strip, line.split(','))
column = line.index('UR[z]')
break
if column is not None:
for line in f:
if not line.strip():
break
print map(str.strip, line.split(','))[column]
output:
-2.6071
-2.6071
-2.6053
-2.5948
-2.5697
-2.5279
-2.4701
-2.3998
...
-2.5948
-2.6053

You could do something like this (a big rough on the float parsing though):
import csv
with open('data.txt','r') as fh:
reader = csv.reader(fh)
for row in reader:
try:
print float(row[3])
except: pass

You could use numpy genfromtxt:
import numpy as np
data = np.genfromtxt('yourfile.txt',delimiter=' , ',skip_header=1, names=True)
Then access your column like this
data['URx']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Show pictures which represent kmeans cluster center (Scikit learn) - python

If you cluster SIFT descriptors, your cluster means will look like sift descriptors, not like images. I believe you were thinking of EigenFaces, but that has little to do with k-means.

Related

Python - Plot linear percentage graph

Padas, how do I create a dataset where the column is a multidimensional array?

Python numpy : Matrix Inverses give unprecise results when multiplied

how to subset pandas dataframe with list

How to return a list of numbers from a text file from a varying location

Categories

Resources