Python numpy : Matrix Inverses give unprecise results when multiplied

Python numpy : Matrix Inverses give unprecise results when multiplied - python

Alright, so I have 3 numpy matrices :
m1 = [[ 3 2 2 ... 2 2 3]
[ 3 2 2 ... 3 3 2]
[500 501 502 ... 625 626 627]
...
[623 624 625 ... 748 749 750]
[624 625 626 ... 749 750 751]
[625 626 627 ... 750 751 752]]
m2 = [[ 3 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]
[ 2 3 500 ... 623 624 625]
...
[ 2 2 500 ... 623 624 625]
[ 2 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]]
m3 = [[ 813 827 160500 ... 199983 200304 200625]
[ 830 843 164000 ... 204344 204672 205000]
[ 181317 185400 36064000 ... 44935744 45007872 45080000]
...
[ 221046 225867 43936000 ... 54744256 54832128 54920000]
[ 221369 226196 44000000 ... 54824000 54912000 55000000]
[ 221692 226525 44064000 ... 54903744 54991872 55080000]]
m1, m2 and m3 are very large square matrices (those examples are 128x128, but they can go up to 2048x2048). Also m1*m2=m3.
My goal is to obtain m2 by using only m1 and m3. Someone told me this was possible, as m1*m2=m3 implies that (m1**-1) * m3 = m2 (I believe it was that, please correct me if I'm wrong) ; so I calculated the inverse of m1 :
m1**-1 = [[ 7.70884284e-01 -8.13188394e-01 -1.65131146e+13 ... -2.49697170e+12
-7.70160676e+12 -4.13395320e+13]
[-3.38144598e-01 2.54532610e-01 1.01286404e+13 ... -3.64296085e+11
2.60327813e+12 2.41783491e+13]
[ 1.77721050e-01 -3.54566231e-01 -5.00564604e+12 ... 5.82415184e+10
-5.98354744e+11 -1.29817153e+13]
...
[-6.56772812e-02 1.54498025e-01 3.21826474e+12 ... 2.61432526e+11
1.14203762e+12 3.61036457e+12]
[ 5.82732587e-03 -3.44252762e-02 -4.79430664e+11 ... 5.10855381e+11
-1.07679881e+11 -1.71485373e+12]
[ 6.55360708e-02 -8.24446025e-02 -1.19618881e+12 ... 4.45713678e+11
-3.48073716e+11 -4.89344092e+12]]
The result looked rather messy so I ran a test and multiplied m1**-1 and m1 to see if it worked :
(m1**-1)*m1 = [[-125.296875 , -117.34375 , -117.390625 , ..., -139.15625 ,
-155.203125 , -147.25 ],
[ 483.1640625 , 483.953125 , 482.7421875 , ..., 603.796875 ,
590.5859375 , 593.375 ],
[-523.22851562, -522.36328125, -523.49804688, ..., -633.07421875,
-635.20898438, -637.34375 ],
...,
[ 10.58691406, 11.68945312, 10.29199219, ..., 14.40429688,
13.00683594, 11.609375 ],
[ -5.32177734, -5.47949219, -4.63720703, ..., -5.28613281,
-5.31884766, -5.6015625 ],
[ -4.93554688, -3.58984375, -3.24414062, ..., -8.72265625,
-5.37695312, -8.03125 ]]
The result is different from the one expected (identity matrix). My guess is that m1 is too big, causing numerical imprecision. But if that previous calculation to get an identity matrix doesn't work properly, then (m1**-1)*m3 surely won't (and it doesn't).
But I really can't decrease the matrix sizes for m1, m2 and m3 and in fact I'd like it to work with even bigger sizes (as said before, max size would be 2048x2048).
Would there be any way to be more precise with such calculations ? Is there an alternative that could work for bigger matrices ?

You are right, inverting a large matrix can be inefficient and numerically unstable. Luckily, there are methods in linear algebra that solve this problem without requiring an inverse.
In this case, m2 = np.linalg.solve(m1, m3) works.

Related

Python - Plot linear percentage graph

I have this numpy array:
[
[ 0 0 0 0 0 0 2 0 2 0 0 1 26 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 4]
[21477 61607 21999 17913 22470 32390 11987 41977 81676 20668 17997 15278 46281 19884]
[ 5059 13248 5498 3866 2144 6161 2361 8734 16914 3724 4614 3607 11305 2880]
[ 282 1580 324 595 218 525 150 942 187 232 430 343 524 189]
[ 1317 6416 1559 882 599 2520 525 2560 19197 729 1391 1727 2044 1198]
]
I've just created logarithm heatmap which works as intended. However I would like to create another heat map that would represent linear scale across rows, and shows each position in matrix corresponding percentage value, while sum of row would give 100%. Without using seaborn or pandas
Something like this:

Here you go:
import matplotlib.pyplot as plt
import numpy as np
a = np.array([[0,0,0,0,0,0,2,0,2,0,0,1,26,0],
[0,0,0,0,0,0,0,0,0,0,0,0,0,4],
[21477,61607,21999,17913,22470,32390,11987,41977,81676,20668,17997,15278,46281,19884],
[5059,13248,5498,3866,2144,6161,2361,8734,16914,3724,4614,3607,11305,2880],
[282,1580,324,595,218,525,150,942,187,232,430,343,524,189],
[1317,6416,1559,882,599,2520,525,2560,19197,729,1391,1727,2044,1198]])
# normalize
normalized_a = a/np.sum(a,axis=1)[:,None]
# plot
plt.imshow(normalized_a)
plt.show()

Padas, how do I create a dataset where the column is a multidimensional array?

This is what my data looks like
print(data)
>
array([[ 0.369 , -0.3396 , 0.1017 , ..., 0.2164 , -0.11163, -0.6025 ],
[ 0.548 , -0.2668 , -0.1425 , ..., -0.3198 , -0.599 , 0.04703],
[ 0.761 , -0.2515 , 0.02998, ..., 0.04663, -0.3276 , -0.1771 ],
...,
[ 0.2148 , -0.492 , -0.03586, ..., 0.1157 , -0.299 , -0.12 ],
[ 0.775 , -0.2622 , -0.1372 , ..., 0.356 , -0.2673 , -0.1897 ],
[ 0.775 , -0.2622 , -0.1372 , ..., 0.356 , -0.2673 , -0.1897 ]],
dtype=float16)
I am trying to convert this to a column in pandas using this
dataset = pd.DataFrame(data, index=[0])
print(dataset)
But I get this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1652
-> 1653 mgr = BlockManager(blocks, axes)
1654 mgr._consolidate_inplace()
7 frames
ValueError: Shape of passed values is (267900, 768), indices imply (1, 768)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
1689 raise ValueError("Empty data passed with indices specified.")
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
1693
ValueError: Shape of passed values is (267900, 768), indices imply (1, 768)
It looks like the tricky part is having a whole array as a row entry.
There was a suggestion of
"remove index dataset = pd.DataFrame(data) "
However , this does not give the desired result. Here's what the result looks like
dataset = pd.DataFrame(embeds16[:,0])
dataset.head()
0 1 2 3 4 5 6 7 8 9 ... 758 759 760 761 762 763 764 765 766 767
0 0.368896 -0.339600 0.101685 0.679199 -0.201904 -0.247192 -0.032776 -0.057098 0.287354 -0.356689 ... 0.064453 0.548340 -0.047729 -0.615723 -0.225464 -0.071106 -0.254395 0.216431 -0.111633 -0.602539
1 0.547852 -0.266846 -0.142456 1.327148 -0.135254 -0.376953 -0.221069 -0.273926 -0.099609 -0.146118 ... 0.138184 0.446777 -0.577637 0.051300 0.187378 0.171021 0.079163 -0.319824 -0.599121 0.047028
2 0.761230 -0.251465 0.029984 1.008789 -0.311279 -0.419922 -0.015869 -0.019196 0.016174 -0.284424 ... 0.152100 0.452881 -0.265381 -0.272949 0.029831 0.002472 0.186646 0.046631 -0.327637 -0.177124
3 0.690918 -0.374756 -0.008820 0.869141 -0.496582 -0.546875 0.060028 0.139893 -0.032471 -0.120361 ... 0.040314 0.391113 -0.420898 -0.342285 0.191650 0.350830 0.083130 0.028137 -0.488525 -0.157349
4 0.583008 -0.342529 -0.073608 0.683105 -0.071777 -0.390137 -0.174316 0.154541 0.170410 -0.184692 ... 0.326416 0.450928 0.083923 -0.331299 -0.207520
I am looking to have the entire array in a single column, not spread over multiple

Do you mean
pd.Series(a.tolist())
Update
pd.Series([x for x in a])

Change format for data imported from file in Python

My data file is Tab separated and looks like this:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
... ... .. .........
I imported them in Python using numpy, here is my script:
from numpy import loadtxt
np_data = loadtxt('u.data', delimiter='\t', skiprows=0)
print(np_data)
I just want to print it to see the result, but it gives me different a format:
[[ 1.96000000e+02 2.42000000e+02 3.00000000e+00 8.81250949e+08]
[ 1.86000000e+02 3.02000000e+02 3.00000000e+00 8.91717742e+08]
[ 2.20000000e+01 3.77000000e+02 1.00000000e+00 8.78887116e+08]
...,
[ 2.76000000e+02 1.09000000e+03 1.00000000e+00 8.74795795e+08]
[ 1.30000000e+01 2.25000000e+02 2.00000000e+00 8.82399156e+08]
[ 1.20000000e+01 2.03000000e+02 3.00000000e+00 8.79959583e+08]]
There is point . in every number in print(np_data). How to format them to look like my original data file?

I've solved this, turn out I miss the dtype argument , so the script should look like this:
from numpy import loadtxt
np_data = loadtxt('u.data',dtype=int ,delimiter='\t', skiprows=0)
print(np_data)
and done

how to subset pandas dataframe with list

I have a pandas dataframe like this.
order_id latitude longitude
0 519 19.119677 72.905081
1 520 19.138250 72.913190
2 521 19.138245 72.913183
3 523 19.117662 72.905484
4 524 19.137793 72.913088
5 525 19.119372 72.893768
6 526 19.116275 72.892951
7 527 19.133430 72.913268
8 528 19.136800 72.917185
9 529 19.118284 72.901114
10 530 19.127193 72.914269
11 531 19.114269 72.904039
12 532 19.136292 72.913941
13 533 19.119075 72.895115
14 534 19.119677 72.905081
15 535 19.119677 72.905081
And one list
DB
Out[658]:
[['523'],
['526', '533'],
['527', '528', '532', '535'],
['530', '519'],
['529', '531', '525', '534'],
['520', '521', '524']]
Now I want to subset dataframe on list elements. There are 6 elements in a list and every element have a sublist of order_id. So, for every sub element I want corresponding latitude and longitude. And then I want to calculate haversine distance between each order_id location:
DB[2]
['527', '528', '532', '535']
Then I want to subset on main dataframe for latitude and longitude pairs. So it should return me an array like this:
array([[ 19.11824057, 72.8939447 ],
[ 19.1355074 , 72.9147978 ],
[ 19.11917348, 72.90518167],
[ 19.127193 , 72.914269 ]])
(Just an example not a correct lat long pairs).
I am doing following:
db_lat = []
db_long = []
for i in range(len(DB)):
l = len(DB[i])
for j in range(l):
db_lat.append(tsp_data_unique.latitude[tsp_data_unique['order_id'] ==
''.join(DB[i][j])])
db_long.append(tsp_data_unique.longitude[tsp_data_unique['order_id']
== ''.join(DB[i][j])])
But it gives me a list of all the lat and long present in DB. Here I am not able to distinguish which lat and long belong to which DB elements. So, for every DB elements (6 in my case) I want 6 arrays of lat and long. Please help.

First of all I would convert your int column to str to compare the dataframe with the values of the list:
df['order_id'] = df['order_id'].apply(str)
and then set the index on order_id:
df = df.set_index('order_id')
Then you can do something like:
pairs = df.loc[DB[2]].values
obtaining:
array([[ 19.13343 , 72.913268],
[ 19.1368 , 72.917185],
[ 19.136292, 72.913941],
[ 19.119677, 72.905081]])
EDIT:
Iterating over your list you can then:
In [93]: for i in range(len(DB)):
....: p = df.loc[DB[i]].values
....: print p
....:
[[ 19.117662 72.905484]]
[[ 19.116275 72.892951]
[ 19.119075 72.895115]]
[[ 19.13343 72.913268]
[ 19.1368 72.917185]
[ 19.136292 72.913941]
[ 19.119677 72.905081]]
[[ 19.127193 72.914269]
[ 19.119677 72.905081]]
[[ 19.118284 72.901114]
[ 19.114269 72.904039]
[ 19.119372 72.893768]
[ 19.119677 72.905081]]
[[ 19.13825 72.91319 ]
[ 19.138245 72.913183]
[ 19.137793 72.913088]]

This is how I solved it. Similar to what #Fabio has posted.
new_DB=[]
for i in range(len(DB)):
new_DB.append(tsp_data_unique[(tsp_data_unique['order_id']).isin(DB[i])]
[['latitude','longitude']].values)

Show pictures which represent kmeans cluster center (Scikit learn)

I have a dataset which was clustered by kmeans. A Friend told me that i can show the pictures which represent each cluster center. He gave's me this short example code:
for i in xrange(len(np.unique(labels))):
this_cluster = np.where(labels == i)[0]
fig, ax = plt.subplots(len(this_cluster))
for im in this_cluster:
ax.imshow(images[im])
I've tried this but it's not working...for e.g I have a small dataset which contains 20 pics. Kmeans returns 50 Centers for this 20 pics. So my np.unique(labels) with (labels = kmeans.labels_?!) is equal to 50...so the "i" runs from 0 to 49...my first "this_cluster" looks like this one:
[ 4 8 18 19 35 37 50 135 140 146 156 214 371 506 563
586 594 887 916 989 993 1021 1061 1105 1121 1128 1405 1409 1458 1466
1481 1484 1505 1572 1573 1620 1784 1817 1835 1854 1945 1955 2004 2006 2054
2135 2204 2245 2319 2321 2343 2391 2410 2414 2486 2502 2530 2594 2624 2629
2825 2828 2833 2911 3017 3097 3245 3246 3298 3347 3493 3568 3627 3677 3701
3789 3866 3941 3944 3969 4022 4115 4214 4215 4432 4527 4559 4594 4645 4668
4699 4785 4797 4802 4807 4831 4892 4905 4921 4929 4932 5076 5178 5233 5249
5318 5463 5508 5571 5621 5644 5661 5678 5690 5727 5736 5737 5755 5777 5961
6088 6089 6107 6197 6353 6487 6500 6515 6565 6575 6601 6706 6749]
so if the next for begans it breaks by i=4 because there are only 20 pictures and images[im] with im>20 will give me a out of bounds...i think "this_clusters" are the Descriptors taken from the dataset which are compute by kmeans and set to cluster 0...so this can't be right?! or am i on the wrong way. Maybe someone could help me.
EDIT*:
create sets
X_train_pos, X_test_pos, X_dataset_train_pos, X_dataset_test_pos = train_test_split(X_desc_pos, dataset_pos, test_size=0.5)
X_train_neg, X_test_neg, X_dataset_train_neg, X_dataset_test_neg = train_test_split(X_desc_neg, dataset_neg, test_size=0.5)
# merge list of array descriptor into descriptor list
x1 = numpy.vstack(X_train_pos)
x2 = numpy.vstack(X_train_neg)
# compute cluster centers
kmeans, n_clusters = dataset_module.create_center_data(numpy.vstack((x1,x2)),numpy.vstack((X_dataset_train_pos,X_dataset_train_neg)))
compute kmeans
def create_center_data(data,dataset):
n_clusters = len(data)
n_clusters = math.sqrt(n_clusters/2)
n_clusters = int(n_clusters)
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
kmeans.fit(data)
numpy.set_printoptions(threshold=numpy.nan)
labels = kmeans.labels_
for i in xrange(len(numpy.unique(labels))):
this_cluster = numpy.where(labels == i)[0]
fig, ax = plt.subplots(len(this_cluster))
for im in this_cluster:
pic = open(dataset[im], "rb")
ax.imshow(pic)
return kmeans, n_clusters
data looks like:
[[ 36. 1. 9. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 1.]
...,
[ 49. 26. 0. ..., 12. 4. 5.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 3. 8. ..., 0. 0. 3.]]
data = all descriptors of the 20 pictures...
dataset is a numpy array with paths to pictures
regards
Linda

If you cluster SIFT descriptors, your cluster means will look like sift descriptors, not like images.
I believe you were thinking of EigenFaces, but that has little to do with k-means.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python numpy : Matrix Inverses give unprecise results when multiplied - python

You are right, inverting a large matrix can be inefficient and numerically unstable. Luckily, there are methods in linear algebra that solve this problem without requiring an inverse. In this case, m2 = np.linalg.solve(m1, m3) works.

Related

Python - Plot linear percentage graph

Padas, how do I create a dataset where the column is a multidimensional array?

Change format for data imported from file in Python

how to subset pandas dataframe with list

Show pictures which represent kmeans cluster center (Scikit learn)

Categories

Resources