How to split data matrix into 3 parts - python

I have the following data x-y-data block x in [390, 1669] and y in [377, 751]. I have the value of the central row y=597. I want to split my data into 3 parts, a central (row 397), an upper (rows 598-751) and a lower part (rows 377-596). I want to automise the function to find the upper bounds for y_max and y_min and don't put them in manually.
def split_data(y_cent_index, data):
'''splitting data matrix into upper, central and lower part'''
center_row = data[y_cent_index,:]
y_max_index =
y_min_index = ?
upper_rows = data[y_cent_index+i,:]
lower_rows = data[y_cent_index-i,:]
return center_rows, upper_rows, lower_rows

If I understand correctly, you want to finde the indices of the biggest and the smallest element in your data?
The you would need to use the function argmax in numpy (argmin respectively)
y_max_index = np.unravel_index(np.argmax(data, axis=None), data.shape)
y_min_index = np.unravel_index(np.argmin(data, axis=None), data.shape)
This would return tuples for the indices, if you just want to get the y part of the index, just access the second value of the tuples.

Related

Can somebody explain how this one hot encoder method works?

I have gotten this code online that one hot encodes an array of label encoded values. I particularly don't understand the last line. Please help
I initially thought that where ever y is 1, it replaces the value of that index with 1, but, how?
def read_dataset():
df = pd.read_csv("sonar.all-data.csv")
x = df[df.columns[0:60]].values
y = df[df.columns[60]]
encoder = LabelEncoder()
encoder.fit(y)
y = oneHotEncode(y)
return(x, y)
def oneHotEncode(labels):
n_labels = len(labels)
n_unique_labels = len(np.unique(labels))
oneHE = np.zeros((n_labels, n_unique_labels))
oneHE[np.arange(n_labels), labels] = 1
return oneHE
I am expecting to under how this code works but I don't understand that line with np.arange
np.arange() is similar to range() but creates a numpy array. Hence, if you have 10 labels, it returns an array with the consecutive numbers from 0 to 9. This is used to choose the row of the oneHE array (that contains only zeros after initialization). The labels is used for choosing the columns.
So it's just selecting the respective column in all rows and setting the values to 1.

Combining two arrays in python term by term

long =np.array(data.Longitude)
lat = np.array(data.Latitude)
coordinates = np.array(385)
for i in range(385):
coordinates[i] = np.array([lat[i], long[i]])
#x, y = kmeans2(whiten(coordinates), 3, iter = 20)
#plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
#plt.show()
I have a dataset with two columns and I wish to merge the latitude and longitude term by term to apply k means clustering after that.Please help with the array part
coordinates = np.array([lat, long])
or am I missing sth here...

Easiest way to select the subset of 2-d array by 2 corners

With an global distributed variable in the shape of (len(latitude), len(longitude)), I want to get an subset corresponding to the area of interest.
The specific area is defined by 2 corners(left-low latitude/longitude, right-top latitude/longitude). So, this array is what I have now:
VALUE is a 2-d array representing the global distribution
Lon is a 1-d array from (-180.0, 179.875) owning 2880 elements
Lat is a 1-d array from (-90, 89.875) owning 1440 elements.
llcrnrlat, urcrnrlat, llcrnrlon, urcrnrlon = 15, 50, 90, 150
Noticing that the llcrnrlat etc may not contain in the Lon or Lat, I can't use
VALUE_SELECT = VALUE[np.where(Lat ==llcrnrlat):np.where(Lat ==urcrnrlat),
np.where(Lon ==urcrnrlat):np.where(Lon ==urcrnrlat)]
So, my attempt is to loop the Lat and Lon aiming to find the indice for nearest value.
def find_nearest(array,value): ## This function was clipped from website
idx = (np.abs(array-value)).argmin()
return array[idx]
llcrnrlon,urcrnrlon = 90,150
llcrnrlat, urcrnrlat = 15, 50
nx_st = np.where(lon == (find_nearest(lon,llcrnrlon )))[0]
nx_en = np.where(lon == (find_nearest(lon,urcrnrlon )))[0]
ny_st = np.where(lat == (find_nearest(lat,llcrnrlat )))[0]
ny_en = np.where(lat == (find_nearest(lat,urcrnrlat )))[0]
lon_select,lat_select = lon[nx_st:nx_en+1], lat[ny_st:ny_en+1]
value_select =VALUE[ny_st:ny_en+1,nx_st:nx_en+1]
After execute these subroutine, there is an warning right here:
/Users/anaconda/lib/python2.7/site-packages/ipykernel/main.py:1: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
if name == 'main':
How to avoid this warning or potential error?
Is there any easier way to get an subset of 2-d array based on my case?
That find_nearest + np.where is a whole lot of computational work that is completely unneeded if the values are evenly distributed. Do you really understand what that code is doing? Read up on each of those functions. You're doing a subtraction on every value in an array, and then finding the index of the minimum of that offset. Then you look up the value at that minimum. Then with the value, you again look through every value in the array and test if it matches, creating a new array of True/False values. Then you sort through that array and find the index of what was True. The same index you already found with argmin()
Each lat/lon is split into 8 sections. So you just need to multiply by 8, right? Then make it an integer, which automatically applies a floor()
function lat_conv(y):
return int((y + 90) * 8)
function lon_conv(x):
return int((x + 180) * 8)
value_select = VALUES[lon_conv(start_lon):lon_conv(stop_lon),
lat_conv(start_lat):lat_conv(stop_lat)]

Getting numpy linspace generated coordinates

I'm using NumPy's linspace to fill in data between points.
lats = (-66.44421,-66.57947,-64.81464,-64.69528)
lons = (-73.03290,-72.73904,-64.71657,-65.03036)
NO3 = (33.48,24.01,17.20,20.03)
xi = np.linspace(min(lats),max(lats),360)
yi = np.linspace(min(lons),max(lons),360)
# grid the data.
zi = griddata((lats, lons), NO3, (xi[None,:], yi[:,None]), method='cubic')
# contour the gridded data.
plt.contourf(xi,yi,zi,15,cmap=cMap)
plt.colorbar()
# plot data points.
plt.scatter(lats,lons,facecolors='none', edgecolors='k',s=26)
plt.show()
I want to retrieve values (missing samples) from the gridded data zi based on coordinate pairs generated from linspace, but the coordinates aren't exact for a dict lookup:
# record index and value of linspace coordinates as key and value
xi_coords = {value: index for index, value in enumerate(xi)}
yi_coords = {value: index for index, value in enumerate(yi)}
# how to retrieve a value inbetween at say... (-65.11018,-67.08512)
zi[xi_coords[-65.11018], yi_coords[-67.08512]]
Returns a Key error.
Is there a smarter workaround for this problem?
If I'm not mistaken the point you try to retrieve is not in your linspace, it is not just a numerical precision problem... If you want to find the closest grid point to any given point, you should define functions rather than using dicts:
latmin = min(lats)
latmax = max(lats)
npoints = 360
def get_lat_index(lat):
return int(round((npoints-1)*(lat-latmin)/(latmax-latmin)))
and similar for longitudes.
One option is rounding. For example to two decimals:
xi_coords = {round(value, 2): index for index, value in enumerate(xi)}
yi_coords = {round(value, 2): index for index, value in enumerate(yi)}
zi[xi_coords[-65.11], yi_coords[-67.08]]

Finding relative maximums of a 2-D numpy array

I have a 2-D numpy array that can be subdivided into 64 boxes (think of a chessboard).
The goal is a function that returns the position and value of the maximum in each box. Something like:
FindRefs(array) --> [(argmaxX00, argmaxY00, Max00), ...,(argmaxX63, argmaxY63, Max63)]
where argmaxXnn and argmaxYnn are the indexes of the whole array (not of the box), and Maxnn is the max value in each box. In other words,
Maxnn = array[argmaxYnn,argmaxYnn]
I've tryed the obvious "nested-for" solution:
def FindRefs(array):
Height, Width = array.shape
plumx = []
plumy = []
lum = []
w = int(Width/8)
h = int(Height/8)
for n in range(0,8): # recorrer boxes
x0 = n*w
x1 = (n+1)*w
for m in range(0,8):
y0 = m*h
y1 = (m+1)*h
subflatind = a[y0:y1,x0:x1].argmax() # flatten index of box
y, x = np.unravel_index(subflatind, (h, w))
X = x0 + x
Y = y0 + y
lum.append(a[Y,X])
plumx.append(X)
plumy.append(Y)
refs = []
for pt in range(0,len(plumx)):
ptx = plumx[pt]
pty = plumy[pt]
refs.append((ptx,pty,lum[pt]))
return refs
It works, but is neither elegant nor eficient.
So I've tryed this more pythonic version:
def FindRefs(a):
box = [(n*w,m*h) for n in range(0,8) for m in range(0,8)]
flatinds = [a[b[1]:h+b[1],b[0]:w+b[0]].argmax() for b in box]
unravels = np.unravel_index(flatinds, (h, w))
ur = [(unravels[1][n],unravels[0][n]) for n in range(0,len(box))]
absinds = [map(sum,zip(box[n],ur[n])) for n in range(0,len(box))]
refs = [(absinds[n][0],absinds[n][1],a[absinds[n][1],absinds[n][0]]) for n in range(0,len(box))]
return refs
It works fine but, to my surprise, is not more efficient than the previous version!
The question is: Is there a more clever way to do the task?
Note that efficiency matters, as I have many large arrays for processing.
Any clue is welcome. :)
Try this:
from numpy.lib.stride_tricks import as_strided as ast
import numpy as np
def FindRefs3(a):
box = tuple(x/8 for x in a.shape)
z=ast(a, \
shape=(8,8)+box, \
strides=(a.strides[0]*box[0],a.strides[1]*box[1])+a.strides)
v3 = np.max(z,axis=-1)
i3r = np.argmax(z,axis=-1)
v2 = np.max(v3,axis=-1)
i2 = np.argmax(v3,axis=-1)
i2x = np.indices(i2.shape)
i3 = i3r[np.ix_(*[np.arange(x) for x in i2.shape])+(i2,)]
i3x = np.indices(i3.shape)
ix0 = i2x[0]*box[0]+i2
ix1 = i3x[1]*box[1]+i3
return zip(np.ravel(ix0),np.ravel(ix1),np.ravel(v2))
Note that your first FindRefs reverses indices, so that for a tuple (i1,i2,v), a[i1,i2] won't return the right value, whereas a[i2,i1] will.
So here's what the code does:
It first calculates the dimensions that each box needs to have (box) given the size of your array. Note that this doesn't do any checking: you need to have an array that can be divided evenly into an 8 by 8 grid.
Then z with ast is the messiest bit. It takes the 2d array, and turns it into a 4d array. The 4d array has dimensions (8,8,box[0],box[1]), so it lets you choose which box you want (the first two axes) and then what position you want in the box (the next two). This lets us deal with all the boxes at once by doing operations on the last two axes.
v3 gives us the maximum values along the last axis: in other words, it contains the maximum of each column in each box. i3r contains the index of which row in the box contained that max value.
v2 takes the maximum of v3 along its own last axis, which is now dealing with rows in the box: it takes the column maxes, and finds the maximum of them, so that v2 is a 2d array containing the maximum value of each box. If all you wanted were the maximums, this is all you'd need.
i2 is the index of the column in the box that holds the maximum value.
Now we need to get the index of the row in the box... that's trickier. i3r contains the row index of the max of each column in the box, but we want the row for the specific column that's specified in i2. We do this by choosing an element from i3r using i2, which gives us i3.
At this point, i2 and i3 are 8 by 8 arrays containing the row and column indexes of the maximums relative to each box. We want the absolute indexes. So we create i2x and i3x (actually, this is pointless; we could just create one, as they are the same), which are just arrays of what the indexes for i2 and i3 are (0,1,2,...,8 etc in one dimension, and so on). We then multiply these by the box sizes, and add the relative max indexes, to get the absolute max indexes.
We then combine these to get the same output that you had. Note that if you keep them as arrays, though, instead of making tuples, it's much faster.

Categories

Resources