Change format for data imported from file in Python - python

My data file is Tab separated and looks like this:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
... ... .. .........
I imported them in Python using numpy, here is my script:
from numpy import loadtxt
np_data = loadtxt('u.data', delimiter='\t', skiprows=0)
print(np_data)
I just want to print it to see the result, but it gives me different a format:
[[ 1.96000000e+02 2.42000000e+02 3.00000000e+00 8.81250949e+08]
[ 1.86000000e+02 3.02000000e+02 3.00000000e+00 8.91717742e+08]
[ 2.20000000e+01 3.77000000e+02 1.00000000e+00 8.78887116e+08]
...,
[ 2.76000000e+02 1.09000000e+03 1.00000000e+00 8.74795795e+08]
[ 1.30000000e+01 2.25000000e+02 2.00000000e+00 8.82399156e+08]
[ 1.20000000e+01 2.03000000e+02 3.00000000e+00 8.79959583e+08]]
There is point . in every number in print(np_data). How to format them to look like my original data file?

I've solved this, turn out I miss the dtype argument , so the script should look like this:
from numpy import loadtxt
np_data = loadtxt('u.data',dtype=int ,delimiter='\t', skiprows=0)
print(np_data)
and done

Related

test_train_split converts string type label to np.array. Is there any way to get back the original label name?

I have an image dataset with a string type label name. When I split the data using test_train_split of sklearn library, it converts the label to np.array type. Is there a way to get back the original string type label name?
The below code splits a data to train and test:
imgs, y = load_images()
train_img,ytrain_img,test_img,ytest_img = train_test_split(imgs,y, test_size=0.2, random_state=1)
If I print y, it gives me the label name but if I print the splitted label value it give an array:
for k in y:
print(k)
break
for k in ytrain_img:
print(k)
break
Output:
001.Affenpinscher
[[[ 97 180 165]
[ 93 174 159]
[ 91 169 152]
...
[[ 88 171 156]
[ 88 170 152]
[ 84 162 145]
...
[130 209 222]
[142 220 233]
[152 230 243]]
[[ 99 181 163]
[ 98 178 161]
[ 92 167 151]
...
[130 212 224]
[137 216 229]
[143 222 235]]
...
[[ 85 147 158]
[ 85 147 158]
[111 173 184]
...
[227 237 244]
[236 248 250]
[234 248 247]]
[[ 94 154 166]
[ 96 156 168]
[133 194 204]
...
[226 238 244]
[237 249 253]
[237 252 254]]
...
[228 240 246]
[238 252 255]
[241 255 255]]]
Is there a way to convert back the array to the original label name?
No, you are inferring the output of train_test_split wrong.
train_test_split works in this way:
A_train, A_test, B_train, B_test, C_train, C_test ...
= train_test_split(A, B, C ..., test_size=0.2)
You can give as many arrays to split. For each given array, it will provide the train and test split first, then do the same for next array, then third array and so on..
So in your case actually it is:
train_img, test_img, ytrain_img, ytest_img = train_test_split(imgs, y,
test_size=0.2,
random_state=1)
But you are then mixing up the names of the output and using them wrong.

Python numpy : Matrix Inverses give unprecise results when multiplied

Alright, so I have 3 numpy matrices :
m1 = [[ 3 2 2 ... 2 2 3]
[ 3 2 2 ... 3 3 2]
[500 501 502 ... 625 626 627]
...
[623 624 625 ... 748 749 750]
[624 625 626 ... 749 750 751]
[625 626 627 ... 750 751 752]]
m2 = [[ 3 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]
[ 2 3 500 ... 623 624 625]
...
[ 2 2 500 ... 623 624 625]
[ 2 2 500 ... 623 624 625]
[ 3 2 500 ... 623 624 625]]
m3 = [[ 813 827 160500 ... 199983 200304 200625]
[ 830 843 164000 ... 204344 204672 205000]
[ 181317 185400 36064000 ... 44935744 45007872 45080000]
...
[ 221046 225867 43936000 ... 54744256 54832128 54920000]
[ 221369 226196 44000000 ... 54824000 54912000 55000000]
[ 221692 226525 44064000 ... 54903744 54991872 55080000]]
m1, m2 and m3 are very large square matrices (those examples are 128x128, but they can go up to 2048x2048). Also m1*m2=m3.
My goal is to obtain m2 by using only m1 and m3. Someone told me this was possible, as m1*m2=m3 implies that (m1**-1) * m3 = m2 (I believe it was that, please correct me if I'm wrong) ; so I calculated the inverse of m1 :
m1**-1 = [[ 7.70884284e-01 -8.13188394e-01 -1.65131146e+13 ... -2.49697170e+12
-7.70160676e+12 -4.13395320e+13]
[-3.38144598e-01 2.54532610e-01 1.01286404e+13 ... -3.64296085e+11
2.60327813e+12 2.41783491e+13]
[ 1.77721050e-01 -3.54566231e-01 -5.00564604e+12 ... 5.82415184e+10
-5.98354744e+11 -1.29817153e+13]
...
[-6.56772812e-02 1.54498025e-01 3.21826474e+12 ... 2.61432526e+11
1.14203762e+12 3.61036457e+12]
[ 5.82732587e-03 -3.44252762e-02 -4.79430664e+11 ... 5.10855381e+11
-1.07679881e+11 -1.71485373e+12]
[ 6.55360708e-02 -8.24446025e-02 -1.19618881e+12 ... 4.45713678e+11
-3.48073716e+11 -4.89344092e+12]]
The result looked rather messy so I ran a test and multiplied m1**-1 and m1 to see if it worked :
(m1**-1)*m1 = [[-125.296875 , -117.34375 , -117.390625 , ..., -139.15625 ,
-155.203125 , -147.25 ],
[ 483.1640625 , 483.953125 , 482.7421875 , ..., 603.796875 ,
590.5859375 , 593.375 ],
[-523.22851562, -522.36328125, -523.49804688, ..., -633.07421875,
-635.20898438, -637.34375 ],
...,
[ 10.58691406, 11.68945312, 10.29199219, ..., 14.40429688,
13.00683594, 11.609375 ],
[ -5.32177734, -5.47949219, -4.63720703, ..., -5.28613281,
-5.31884766, -5.6015625 ],
[ -4.93554688, -3.58984375, -3.24414062, ..., -8.72265625,
-5.37695312, -8.03125 ]]
The result is different from the one expected (identity matrix). My guess is that m1 is too big, causing numerical imprecision. But if that previous calculation to get an identity matrix doesn't work properly, then (m1**-1)*m3 surely won't (and it doesn't).
But I really can't decrease the matrix sizes for m1, m2 and m3 and in fact I'd like it to work with even bigger sizes (as said before, max size would be 2048x2048).
Would there be any way to be more precise with such calculations ? Is there an alternative that could work for bigger matrices ?
You are right, inverting a large matrix can be inefficient and numerically unstable. Luckily, there are methods in linear algebra that solve this problem without requiring an inverse.
In this case, m2 = np.linalg.solve(m1, m3) works.

storing data in numpy array without overwriting previous data

i am running a for loop to store the data in a numpy array. the problem is that after every iteration previous data is overwritten by latest one . i want to be able to store all data bu using some "extend" function as used for simple arrays. i tried append but it's not storing all the values of all arrays.
code
data1=np.empty((3,3),dtype=np.int8)
top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
for row in df.itertuples():
data = io.imread(row[1])
data1=np.append(data1,data)
print(data1)
expected output
[[[ 34 34 34]
[ 35 35 35]
[ 40 40 40]
...,
[ 8 8 8]
[ 12 12 12]
[ 12 12 12]]
[[ 39 39 39]
[ 30 30 30]
[ 25 25 25]
...,
[ 11 11 11]
[ 1 1 1]
[ 5 5 5]]
[[ 54 54 54]
[ 44 44 44]
[ 34 34 34]
...,
[ 32 32 32]
[ 9 9 9]
[ 0 0 0]]
...,
[[212 212 210]
[167 167 165]
[118 118 116]
...,
[185 186 181]
[176 177 172]
[170 171 166]]
[[220 220 218]
[165 165 163]
[116 116 114]
...,
[158 159 154]
[156 157 152]
[170 171 166]]
[[220 220 218]
[154 154 152]
[106 106 104]
...,
[144 145 140]
[136 137 132]
[158 159 154]]]
top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
data1 = np.array([io.imread(row[1]) for row in df.itertuples()])
If your dataset is not too big, there is no problem with using a standard list first and then convert to numpy array, I guess.
If you're not familiar with implicit lists:
data1 = []
for row in df.itertuples():
data1.append(io.imread(row[1]))
data1 = np.array(data1)

Why python cv2.resize function on RGB image gives different results than MATLAB imresize function? [duplicate]

This question already has answers here:
How to use Matlab's imresize in python
(3 answers)
Closed 5 years ago.
I'm transferring a python script into MATLAB one. One step in python script is to resize a 256*256 RGB image to 40*40 RGB image, using cv2.resize function, i.e.
import cv2
img = cv2.imread('0.png')
img_40 = cv2.resize(img, (40, 40)) # img rescaled to 40*40
And, I print some pixel values of B channel in 40*40 image.
print img_40[0:10, 0:10, 0]
ans =
[[ 0 0 1 2 1 3 0 21 96 128]
[ 2 0 17 13 5 20 15 48 112 126]
[ 0 0 6 0 2 3 80 107 122 129]
[ 0 5 1 7 0 14 98 132 129 127]
[ 1 2 0 0 0 16 100 151 138 134]
[ 0 2 0 2 0 34 105 138 143 139]
[ 0 3 0 0 0 54 96 29 51 79]
[ 5 0 0 0 0 56 118 103 97 38]
[ 3 0 0 0 2 44 132 95 93 89]
[ 1 0 1 3 0 38 141 128 104 26]]
However, when I use MATLAB imresize function, I got a slightly different result. PS: I have set the AntiAliasing to false as mentioned here.
img = imread('0.png');
img_40 = imresize(img,[40,40],'bilinear','AntiAliasing',false);
img_40(1:10,1:10,3)
ans =
0 0 2 1 2 4 0 21 96 128
2 0 18 13 5 20 15 48 112 127
0 0 6 0 3 3 81 107 123 129
0 5 1 7 0 14 99 133 129 127
1 2 0 0 0 16 100 151 139 134
0 2 0 2 0 34 105 139 144 140
0 3 0 0 0 54 96 29 51 79
6 0 0 0 0 57 119 104 97 39
3 0 0 0 2 44 132 96 93 89
1 0 1 3 1 38 141 129 104 26
And provide test image 0.png
Looking forward to any explanation that will help me.
Update- October 12th, 2017
As noted by Ander Biguri in his comments below, the reason for the problem may be: the interpolation gives float values, but I am working with uint8 values, which may cause rounding error.
And convert the image matrix to double type, it seems correct.
With python cv2:
img = cv2.imread('0.png')
d_img = img.astype('float')
img_40 = cv2.resize(d_img, (40, 40)) # img rescaled to 40*40
print img_40[0:10, 0:10, 0]
ans =
[[ 3.00000000e-01 0.00000000e+00 1.40000000e+00 1.81000000e+00
1.61000000e+00 3.49000000e+00 0.00000000e+00 2.13500000e+01
9.62500000e+01 1.28080000e+02]
[ 2.00000000e+00 0.00000000e+00 1.75000000e+01 1.31900000e+01
4.87000000e+00 1.98900000e+01 1.52300000e+01 4.81500000e+01
1.12420000e+02 1.26630000e+02]
[ 0.00000000e+00 0.00000000e+00 5.75000000e+00 0.00000000e+00
2.35000000e+00 2.80000000e+00 8.04000000e+01 1.06500000e+02
1.22300000e+02 1.28800000e+02]
[ 0.00000000e+00 5.28000000e+00 9.50000000e-01 7.19000000e+00
0.00000000e+00 1.40300000e+01 9.85100000e+01 1.32200000e+02
1.28990000e+02 1.27270000e+02]
[ 7.90000000e-01 2.40000000e+00 1.50000000e-01 7.00000000e-02
0.00000000e+00 1.65100000e+01 9.96300000e+01 1.50750000e+02
1.38440000e+02 1.34300000e+02]
[ 0.00000000e+00 2.13000000e+00 1.50000000e-01 1.90000000e+00
0.00000000e+00 3.42900000e+01 1.04730000e+02 1.38400000e+02
1.43560000e+02 1.39400000e+02]
[ 0.00000000e+00 2.70000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 5.43700000e+01 9.57400000e+01 2.88500000e+01
5.05500000e+01 7.88600000e+01]
[ 5.55000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
1.50000000e-01 5.65500000e+01 1.18500000e+02 1.03000000e+02
9.73000000e+01 3.81000000e+01]
[ 3.30000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
1.84000000e+00 4.40700000e+01 1.32280000e+02 9.53000000e+01
9.27000000e+01 8.92400000e+01]
[ 1.40000000e+00 0.00000000e+00 1.00000000e+00 3.03000000e+00
7.00000000e-01 3.79900000e+01 1.40600000e+02 1.28600000e+02
1.04270000e+02 2.61200000e+01]]
With MATLAB:
img = imread('0.png');
d_img = single(img);
img_40 = imresize(d_img,[40,40],'bilinear','AntiAliasing',false);
img_40(1:10,1:10,3)
0.3000 0 1.4000 1.8100 1.6100 3.4900 0 21.3500 96.2500 128.0800
2.0000 0 17.5000 13.1900 4.8700 19.8900 15.2300 48.1500 112.4200 126.6300
0 0 5.7500 0 2.3500 2.8000 80.4000 106.5000 122.3000 128.8000
0 5.2800 0.9500 7.1900 0 14.0300 98.5100 132.2000 128.9900 127.2700
0.7900 2.4000 0.1500 0.0700 0 16.5100 99.6300 150.7500 138.4400 134.3000
0 2.1300 0.1500 1.9000 0 34.2900 104.7300 138.4000 143.5600 139.4000
0 2.7000 0 0 0 54.3700 95.7400 28.8500 50.5500 78.8600
5.5500 0 0 0 0.1500 56.5500 118.5000 103.0000 97.3000 38.1000
3.3000 0 0 0 1.8400 44.0700 132.2800 95.3000 92.7000 89.2400
1.4000 0 1.0000 3.0300 0.7000 37.9900 140.6000 128.6000 104.2700 26.1200
PS: answers to the question How to use Matlab's imresize in python do not mention the rounding error caused by interpolation, when working with uint8 values.
This is due to rounding error. Note that all differences are just of 1 unit.
You are working with uint8 values, however the interpolation will almost always give float values. the middle point between the pixel values [0 - 3] is 1.5. Depending on the exact order of the math operators going on inside resize, the result may have been 1.4999999999999 or 1.50000000000001, and then when rounding you'd get 1 or 2.
Read more about floating points maths

how to subset pandas dataframe with list

I have a pandas dataframe like this.
order_id latitude longitude
0 519 19.119677 72.905081
1 520 19.138250 72.913190
2 521 19.138245 72.913183
3 523 19.117662 72.905484
4 524 19.137793 72.913088
5 525 19.119372 72.893768
6 526 19.116275 72.892951
7 527 19.133430 72.913268
8 528 19.136800 72.917185
9 529 19.118284 72.901114
10 530 19.127193 72.914269
11 531 19.114269 72.904039
12 532 19.136292 72.913941
13 533 19.119075 72.895115
14 534 19.119677 72.905081
15 535 19.119677 72.905081
And one list
DB
Out[658]:
[['523'],
['526', '533'],
['527', '528', '532', '535'],
['530', '519'],
['529', '531', '525', '534'],
['520', '521', '524']]
Now I want to subset dataframe on list elements. There are 6 elements in a list and every element have a sublist of order_id. So, for every sub element I want corresponding latitude and longitude. And then I want to calculate haversine distance between each order_id location:
DB[2]
['527', '528', '532', '535']
Then I want to subset on main dataframe for latitude and longitude pairs. So it should return me an array like this:
array([[ 19.11824057, 72.8939447 ],
[ 19.1355074 , 72.9147978 ],
[ 19.11917348, 72.90518167],
[ 19.127193 , 72.914269 ]])
(Just an example not a correct lat long pairs).
I am doing following:
db_lat = []
db_long = []
for i in range(len(DB)):
l = len(DB[i])
for j in range(l):
db_lat.append(tsp_data_unique.latitude[tsp_data_unique['order_id'] ==
''.join(DB[i][j])])
db_long.append(tsp_data_unique.longitude[tsp_data_unique['order_id']
== ''.join(DB[i][j])])
But it gives me a list of all the lat and long present in DB. Here I am not able to distinguish which lat and long belong to which DB elements. So, for every DB elements (6 in my case) I want 6 arrays of lat and long. Please help.
First of all I would convert your int column to str to compare the dataframe with the values of the list:
df['order_id'] = df['order_id'].apply(str)
and then set the index on order_id:
df = df.set_index('order_id')
Then you can do something like:
pairs = df.loc[DB[2]].values
obtaining:
array([[ 19.13343 , 72.913268],
[ 19.1368 , 72.917185],
[ 19.136292, 72.913941],
[ 19.119677, 72.905081]])
EDIT:
Iterating over your list you can then:
In [93]: for i in range(len(DB)):
....: p = df.loc[DB[i]].values
....: print p
....:
[[ 19.117662 72.905484]]
[[ 19.116275 72.892951]
[ 19.119075 72.895115]]
[[ 19.13343 72.913268]
[ 19.1368 72.917185]
[ 19.136292 72.913941]
[ 19.119677 72.905081]]
[[ 19.127193 72.914269]
[ 19.119677 72.905081]]
[[ 19.118284 72.901114]
[ 19.114269 72.904039]
[ 19.119372 72.893768]
[ 19.119677 72.905081]]
[[ 19.13825 72.91319 ]
[ 19.138245 72.913183]
[ 19.137793 72.913088]]
This is how I solved it. Similar to what #Fabio has posted.
new_DB=[]
for i in range(len(DB)):
new_DB.append(tsp_data_unique[(tsp_data_unique['order_id']).isin(DB[i])]
[['latitude','longitude']].values)

Categories

Resources