storing data in numpy array without overwriting previous data - python

i am running a for loop to store the data in a numpy array. the problem is that after every iteration previous data is overwritten by latest one . i want to be able to store all data bu using some "extend" function as used for simple arrays. i tried append but it's not storing all the values of all arrays.
code
data1=np.empty((3,3),dtype=np.int8)
top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
for row in df.itertuples():
data = io.imread(row[1])
data1=np.append(data1,data)
print(data1)
expected output
[[[ 34 34 34]
[ 35 35 35]
[ 40 40 40]
...,
[ 8 8 8]
[ 12 12 12]
[ 12 12 12]]
[[ 39 39 39]
[ 30 30 30]
[ 25 25 25]
...,
[ 11 11 11]
[ 1 1 1]
[ 5 5 5]]
[[ 54 54 54]
[ 44 44 44]
[ 34 34 34]
...,
[ 32 32 32]
[ 9 9 9]
[ 0 0 0]]
...,
[[212 212 210]
[167 167 165]
[118 118 116]
...,
[185 186 181]
[176 177 172]
[170 171 166]]
[[220 220 218]
[165 165 163]
[116 116 114]
...,
[158 159 154]
[156 157 152]
[170 171 166]]
[[220 220 218]
[154 154 152]
[106 106 104]
...,
[144 145 140]
[136 137 132]
[158 159 154]]]

top_model_weights_path = '/home/ethnicity.071217.23-0.28.hdf5'
df = pd.read_csv('/home/instaurls.csv')
data1 = np.array([io.imread(row[1]) for row in df.itertuples()])
If your dataset is not too big, there is no problem with using a standard list first and then convert to numpy array, I guess.
If you're not familiar with implicit lists:
data1 = []
for row in df.itertuples():
data1.append(io.imread(row[1]))
data1 = np.array(data1)

Related

Find indices for max values subarrays and applying it on that subarray

I have a file f which holds N (unknown) events. Each event carries an (unknown and different for each event, call it i, j etc) amount of reconstructed tracks. Then, each track has properties like energy E and likelihood lik. So,
>>> print(f.events.tracks.lik)
[[lik1, lik2, ..., likX], [lik1, lik2, ..., likj], ..., [lik1, lik2, ..., likz]]
prints an array holding N subarrays (1 per event), each presenting the lik for all its tracks.
GOAL: call f.events.tracks[:, Inds].E to get the energies for the tracks with max likelihood.
Minimal code example
>>>import numpy as np
>>>lik = np.random.randint(low=0, high=100, size=50).reshape(5, 10)
>>>print(lik)
[[ 3 49 27 3 80 59 96 99 84 34]
[88 62 61 83 90 9 62 30 92 80]
[ 5 21 69 40 2 40 13 63 42 46]
[ 0 55 71 67 63 49 29 7 21 7]
[40 7 68 46 95 34 74 88 79 15]]
>>>energy = np.random.randint(low=100, high=2000, size=50).reshape(5, 10)
>>>print(energy)
[[1324 1812 917 553 185 743 358 877 1041 905]
[1407 663 359 383 339 1403 1511 1964 1797 1096]
[ 315 1431 565 786 544 1370 919 1617 1442 925]
[1710 698 246 1631 1374 1844 595 465 908 953]
[ 305 384 668 952 458 793 303 153 661 791]]
>>> Inds = np.argmax(lik, axis=1)
>>> print(Inds)
[2 1 8 6 7]
PROBLEM:
>>> # call energy[Inds] to get
# [917, 663, 1442, 1844, 153]
What is the correct way of accessing these energies?
You can select the values indexed by Inds for each line using a 2D indexing with a temporary array containing [0,1,2,...] (generated using np.arange).
Here is an example:
energy[np.arange(len(Inds)), Inds]

Splitting series with single column that contains list, into multiple columns with single values

Given a Series object which I have pulled from a dataframe, for example through:
columns = list(df)
for col in columns:
s = df[col] # The series object
The Series contains a <class 'list'> in each row, making it look like this:
0 [116, 66]
2 [116, 66]
4 [116, 66]
6 [116, 66]
8 [116, 66]
...
1498 [117, 66]
1500 [117, 66]
1502 [117, 66]
1504 [117, 66]
1506 [117, 66]
How could I split this up, so it becomes two columns in the Series instead?
0 116 66
2 116 66
...
1506 116 66
And then append it back to the original df?
From Ch3steR's comment of using pd.DataFrame(s.tolist()), I managed to get the answer I was looking for, including renaming the columns in the new dataframe to also include the column name of the existing Series.
columns = list(df)
for col in columns:
df2 = pd.DataFrame(df[col].tolist())
df2.columns = [col+"_"+str(y) for y in range(len(df2.columns))]
print(df2)
To keep this shorter, as also suggested by Ch3steR, we can simplify the above to:
columns = list(df)
for col in columns:
df2 = pd.DataFrame(df[col].tolist()).add_prefix(col)
print(df2)
Which in my case, gives the following output:
FrameLen_0 FrameLen_1
0 116 66
1 116 66
2 116 66
3 116 66
4 116 66
.. ... ...
749 117 66
750 117 66
751 117 66
752 117 66
753 117 66

Getting a list of list of Nearest Neighbour within boundaries

I'm trying to return a list of list of vertical, horizontal and diagonal nearest neighbors of every item of a 2D numpy array
import numpy as np
import copy
tilemap = np.arange(99).reshape(11, 9)
print(tilemap)
def get_neighbor(pos, array):
x = copy.deepcopy(pos[0])
y = copy.deepcopy(pos[1])
grid = copy.deepcopy(array)
split = []
split.append([grid[y-1][x-1]])
split.append([grid[y-1][x]])
split.append([grid[y-1][x+1]])
split.append([grid[y][x - 1]])
split.append([grid[y][x+1]])
split.append([grid[y+1][x-1]])
split.append([grid[y+1][x]])
split.append([grid[y+1][x+1]])
print("\n Neighbors of ITEM[{}]\n {}".format(grid[y][x],split))
cordinates = [5, 6]
get_neighbor(pos=cordinates, array=tilemap)
i would want a list like this:
first item = 0
[[1],[12],[13],
[1,2], [12,24],[13,26],
[1,2,3], [12,24,36], [13,26,39]....
till it get to the boundaries completely then proceeds to second item = 1
and keeps adding to the list. if there is a neighbor above it should be add too..
MY RESULT
[[ 0 1 2 3 4 5 6 7 8]
[ 9 10 11 12 13 14 15 16 17]
[18 19 20 21 22 23 24 25 26]
[27 28 29 30 31 32 33 34 35]
[36 37 38 39 40 41 42 43 44]
[45 46 47 48 49 50 51 52 53]
[54 55 56 57 58 59 60 61 62]
[63 64 65 66 67 68 69 70 71]
[72 73 74 75 76 77 78 79 80]
[81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98]]
Neighbors of ITEM[59]
[[49], [50], [51], [58], [60], [67], [68], [69]]
Alright, what about a using a function like this? This takes the array, your target index, and the "radius" of the elements to be included.
def get_idx_adj(arr, idx, radius):
num_rows, num_cols = arr.shape
idx_row, idx_col = idx
slice_1 = np.s_[max(0, idx_row - radius):min(num_rows, idx_row + radius + 1)]
slice_2 = np.s_[max(0, idx_col - radius):min(num_cols, idx_col + radius + 1)]
return arr[slice_1, slice_2]
I'm currently trying to find the best way to transform the index of the element, so that the function can be used on its own output successively to get all the subarrays of various sizes.

test_train_split converts string type label to np.array. Is there any way to get back the original label name?

I have an image dataset with a string type label name. When I split the data using test_train_split of sklearn library, it converts the label to np.array type. Is there a way to get back the original string type label name?
The below code splits a data to train and test:
imgs, y = load_images()
train_img,ytrain_img,test_img,ytest_img = train_test_split(imgs,y, test_size=0.2, random_state=1)
If I print y, it gives me the label name but if I print the splitted label value it give an array:
for k in y:
print(k)
break
for k in ytrain_img:
print(k)
break
Output:
001.Affenpinscher
[[[ 97 180 165]
[ 93 174 159]
[ 91 169 152]
...
[[ 88 171 156]
[ 88 170 152]
[ 84 162 145]
...
[130 209 222]
[142 220 233]
[152 230 243]]
[[ 99 181 163]
[ 98 178 161]
[ 92 167 151]
...
[130 212 224]
[137 216 229]
[143 222 235]]
...
[[ 85 147 158]
[ 85 147 158]
[111 173 184]
...
[227 237 244]
[236 248 250]
[234 248 247]]
[[ 94 154 166]
[ 96 156 168]
[133 194 204]
...
[226 238 244]
[237 249 253]
[237 252 254]]
...
[228 240 246]
[238 252 255]
[241 255 255]]]
Is there a way to convert back the array to the original label name?
No, you are inferring the output of train_test_split wrong.
train_test_split works in this way:
A_train, A_test, B_train, B_test, C_train, C_test ...
= train_test_split(A, B, C ..., test_size=0.2)
You can give as many arrays to split. For each given array, it will provide the train and test split first, then do the same for next array, then third array and so on..
So in your case actually it is:
train_img, test_img, ytrain_img, ytest_img = train_test_split(imgs, y,
test_size=0.2,
random_state=1)
But you are then mixing up the names of the output and using them wrong.

how to subset pandas dataframe with list

I have a pandas dataframe like this.
order_id latitude longitude
0 519 19.119677 72.905081
1 520 19.138250 72.913190
2 521 19.138245 72.913183
3 523 19.117662 72.905484
4 524 19.137793 72.913088
5 525 19.119372 72.893768
6 526 19.116275 72.892951
7 527 19.133430 72.913268
8 528 19.136800 72.917185
9 529 19.118284 72.901114
10 530 19.127193 72.914269
11 531 19.114269 72.904039
12 532 19.136292 72.913941
13 533 19.119075 72.895115
14 534 19.119677 72.905081
15 535 19.119677 72.905081
And one list
DB
Out[658]:
[['523'],
['526', '533'],
['527', '528', '532', '535'],
['530', '519'],
['529', '531', '525', '534'],
['520', '521', '524']]
Now I want to subset dataframe on list elements. There are 6 elements in a list and every element have a sublist of order_id. So, for every sub element I want corresponding latitude and longitude. And then I want to calculate haversine distance between each order_id location:
DB[2]
['527', '528', '532', '535']
Then I want to subset on main dataframe for latitude and longitude pairs. So it should return me an array like this:
array([[ 19.11824057, 72.8939447 ],
[ 19.1355074 , 72.9147978 ],
[ 19.11917348, 72.90518167],
[ 19.127193 , 72.914269 ]])
(Just an example not a correct lat long pairs).
I am doing following:
db_lat = []
db_long = []
for i in range(len(DB)):
l = len(DB[i])
for j in range(l):
db_lat.append(tsp_data_unique.latitude[tsp_data_unique['order_id'] ==
''.join(DB[i][j])])
db_long.append(tsp_data_unique.longitude[tsp_data_unique['order_id']
== ''.join(DB[i][j])])
But it gives me a list of all the lat and long present in DB. Here I am not able to distinguish which lat and long belong to which DB elements. So, for every DB elements (6 in my case) I want 6 arrays of lat and long. Please help.
First of all I would convert your int column to str to compare the dataframe with the values of the list:
df['order_id'] = df['order_id'].apply(str)
and then set the index on order_id:
df = df.set_index('order_id')
Then you can do something like:
pairs = df.loc[DB[2]].values
obtaining:
array([[ 19.13343 , 72.913268],
[ 19.1368 , 72.917185],
[ 19.136292, 72.913941],
[ 19.119677, 72.905081]])
EDIT:
Iterating over your list you can then:
In [93]: for i in range(len(DB)):
....: p = df.loc[DB[i]].values
....: print p
....:
[[ 19.117662 72.905484]]
[[ 19.116275 72.892951]
[ 19.119075 72.895115]]
[[ 19.13343 72.913268]
[ 19.1368 72.917185]
[ 19.136292 72.913941]
[ 19.119677 72.905081]]
[[ 19.127193 72.914269]
[ 19.119677 72.905081]]
[[ 19.118284 72.901114]
[ 19.114269 72.904039]
[ 19.119372 72.893768]
[ 19.119677 72.905081]]
[[ 19.13825 72.91319 ]
[ 19.138245 72.913183]
[ 19.137793 72.913088]]
This is how I solved it. Similar to what #Fabio has posted.
new_DB=[]
for i in range(len(DB)):
new_DB.append(tsp_data_unique[(tsp_data_unique['order_id']).isin(DB[i])]
[['latitude','longitude']].values)

Categories

Resources