Combining two arrays in python term by term - python

long =np.array(data.Longitude)
lat = np.array(data.Latitude)
coordinates = np.array(385)
for i in range(385):
coordinates[i] = np.array([lat[i], long[i]])
#x, y = kmeans2(whiten(coordinates), 3, iter = 20)
#plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
#plt.show()
I have a dataset with two columns and I wish to merge the latitude and longitude term by term to apply k means clustering after that.Please help with the array part

coordinates = np.array([lat, long])
or am I missing sth here...

Related

Order of input data effects result of 3d polynomial fit

I am working on a 3D polynomial fit. The data I have are the x and y coordinate of 5 stations, and the velocities at these 5 stations. What I want to do is fit a grid through these points. In turn I will use this grid to determine the velocity gradients using the predicted velocities at each grid point.
My code is:
xx_vel = np.array([[4.78,52.32] ,[5.18,52.10], [4.45,51.97], [4.92,51.97], [5.15,51.85]]) #location of stations in degrees longitude and latitude
X = xx_vel #coordinates
Z = np.array([-0.00, -0.766, -0.00, -1.732, -1.00]) #velocities at 5 stations
deg_of_poly = 3
poly = PolynomialFeatures(degree=deg_of_poly)
X_ = poly.fit_transform(X)
clf = linear_model.LinearRegression()
clf.fit(X_, Z)
x_pred = np.linspace(4, 6, 27) #defining grid points
y_pred = np.linspace(51.5, 52.7, 27) #defining grid points
predict_x, predict_y = np.meshgrid(x_pred, y_pred)
predict_xy = np.concatenate((predict_x.reshape(-1, 1), predict_y.reshape(-1, 1)), axis=1)
predict_x_ = poly.fit_transform(predict_xy)
predict_z = clf.predict(predict_x_)
predict_z_poly = predict_z.reshape(predict_x.shape)
Using this code I obtain the following fit:
This all seemed fine, until I changed the order of the input data. So if I for example switch the first and last stations so that my input arrays are:
xx_vel = np.array([[5.15,51.85],[5.18,52.10], [4.45,51.97], [4.92,51.97], [4.78,52.32]])
Z = np.array([-1.00, -0.766, -0.00, -1.732, -0.00])
I obtain a different fit. Is there something I am doing wrong? Or is there a way I can make sure I obtain the same results no matter in what order the data is given? I would think that this should not have an effect on the result.
Thanks in advance!

Easiest way to select the subset of 2-d array by 2 corners

With an global distributed variable in the shape of (len(latitude), len(longitude)), I want to get an subset corresponding to the area of interest.
The specific area is defined by 2 corners(left-low latitude/longitude, right-top latitude/longitude). So, this array is what I have now:
VALUE is a 2-d array representing the global distribution
Lon is a 1-d array from (-180.0, 179.875) owning 2880 elements
Lat is a 1-d array from (-90, 89.875) owning 1440 elements.
llcrnrlat, urcrnrlat, llcrnrlon, urcrnrlon = 15, 50, 90, 150
Noticing that the llcrnrlat etc may not contain in the Lon or Lat, I can't use
VALUE_SELECT = VALUE[np.where(Lat ==llcrnrlat):np.where(Lat ==urcrnrlat),
np.where(Lon ==urcrnrlat):np.where(Lon ==urcrnrlat)]
So, my attempt is to loop the Lat and Lon aiming to find the indice for nearest value.
def find_nearest(array,value): ## This function was clipped from website
idx = (np.abs(array-value)).argmin()
return array[idx]
llcrnrlon,urcrnrlon = 90,150
llcrnrlat, urcrnrlat = 15, 50
nx_st = np.where(lon == (find_nearest(lon,llcrnrlon )))[0]
nx_en = np.where(lon == (find_nearest(lon,urcrnrlon )))[0]
ny_st = np.where(lat == (find_nearest(lat,llcrnrlat )))[0]
ny_en = np.where(lat == (find_nearest(lat,urcrnrlat )))[0]
lon_select,lat_select = lon[nx_st:nx_en+1], lat[ny_st:ny_en+1]
value_select =VALUE[ny_st:ny_en+1,nx_st:nx_en+1]
After execute these subroutine, there is an warning right here:
/Users/anaconda/lib/python2.7/site-packages/ipykernel/main.py:1: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
if name == 'main':
How to avoid this warning or potential error?
Is there any easier way to get an subset of 2-d array based on my case?
That find_nearest + np.where is a whole lot of computational work that is completely unneeded if the values are evenly distributed. Do you really understand what that code is doing? Read up on each of those functions. You're doing a subtraction on every value in an array, and then finding the index of the minimum of that offset. Then you look up the value at that minimum. Then with the value, you again look through every value in the array and test if it matches, creating a new array of True/False values. Then you sort through that array and find the index of what was True. The same index you already found with argmin()
Each lat/lon is split into 8 sections. So you just need to multiply by 8, right? Then make it an integer, which automatically applies a floor()
function lat_conv(y):
return int((y + 90) * 8)
function lon_conv(x):
return int((x + 180) * 8)
value_select = VALUES[lon_conv(start_lon):lon_conv(stop_lon),
lat_conv(start_lat):lat_conv(stop_lat)]

Measure distance between data set of 5D

I want to measure the distance (Euclidean) between data sets of 5 dimensions.
It looks like this:
center x
0 [0.09771348879, 1.856078237, 2.100760575, 9.25... [-1.35602640228e-12, -2.94706481441e-11, -6.51...
1 [8.006780488, 1.097849488, 0.6275244427, 0.572... [4.99212418613, 5.01853294023, -0.014304672946...
2 [-1.40785823, -1.714959744, -0.5524032233, -0.... [-1.61000102139e-11, -4.680034138e-12, 1.96087...
index, then point (center), and the third is the other point (x), all the points are 5D.
I want to use pdist since it's applicable to n-d. But the problem is that the points are arranged as m n-dimensional row vectors in the matrix X. While what I have above is only the data format and not the matrix and contains the index as well which it should not.
My code is:( S is the format above)
S = pd.DataFrame(paired_data, columns=['x','center'])
print (S.to_string())
Y = pdist(S[1:], 'euclidean')
print Y
This seems to work:
for i in range(S.shape[0]):
M = np.matrix( [S['x'][i], S['center'][i]] )
print pdist(M, 'euclidean')
or with iterrows():
for row in S.iterrows():
M = np.matrix( [row[1]['x'], row[1]['center']] )
print pdist(M, 'euclidean')
Note that the creation of a matrix isn't necessary, pdist will handle a python list of lists just fine:
for row in S.iterrows():
print pdist([row[1]['x'], row[1]['center']], 'euclidean')

Getting CDF of variable-sized numpy arrays in Python using same bins?

I'd like to make a set of comparable empirical CDFs for a few numpy arrays (each of different length) and store these in a pandas dataframe:
a = scipy.randn(100)
b = scipy.randn(500)
# ECDF from statmodels
cdf_a = ECDF(a)
cdf_b = ECDF(b)
The problem is that cdf_a.x, cdf_a.y will be of different lengths of cdf_b.x, cdf_b.y and I would like these to be the same length, i.e. use same number of bins to compute the CDF so that these can be plotted on same scale from a pandas DataFrame. This is not possible:
df = pandas.DataFrame({"cdf_a": cdf_a.y, "cdf_b": cdf_b.y})
Since the cdfs are not of the same length. How can I bin a and b using similar bins when computing their CDFs, so that I get comparable same-length vectors back?
Is this the best solution?
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
The way we use it in some goodness of fit tests is to stack the arrays, so they are defined on all points, points from both arrays.
Then use np.searchsorted to get the ranking, number of points in dataset 1 below x and number of points in dataset 2 below x.
If I remember correctly, look at scipy.stats.ks_2samp
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
It appears that this is a good solution:
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
Then len(v1) == len(v2) and these can be plotted as CDFs of a, b on the same scale.

Remove data points below a curve with python

I need to compare some theoretical data with real data in python.
The theoretical data comes from resolving an equation.
To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib).
Both the theoretical curves and the data points are arrays of different length.
I can try to remove the points in a roughly-eye way, for example: the first upper point can be detected using:
data2[(data2.redshift<0.4)&data2.dmodulus>1]
rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])
But I would like to use a less roughly-eye way.
So, can anyone help me finding an easy way of removing the problematic points?
Thank you!
This might be overkill and is based on your comment
Both the theoretical curves and the data points are arrays of
different length.
I would do the following:
Truncate the data set so that its x values lie within the max and min values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d.
Use numpy.where to find data y values that are out side the range of acceptable theory values.
DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the 'inliners' one color and the 'outliers' an other color.
Here's a script that is close to what you are looking for, I think. It hopefully will help you accomplish what you want:
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
# make up data
def makeUpData():
'''Make many more data points (x,y,yerr) than theory (x,y),
with theory yerr corresponding to a constant "sigma" in y,
about x,y value'''
NX= 150
dataX = (np.random.rand(NX)*1.1)**2
dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX
dataErr = np.random.rand(NX)*dataX*1.3
theoryX = np.arange(0,1,0.1)
theoryY = theoryX*theoryX*1.5
theoryErr = 0.5
return dataX,dataY,dataErr,theoryX,theoryY,theoryErr
def makeSameXrange(theoryX,dataX,dataY):
'''
Truncate the dataX and dataY ranges so that dataX min and max are with in
the max and min of theoryX.
'''
minT,maxT = theoryX.min(),theoryX.max()
goodIdxMax = np.where(dataX<maxT)
goodIdxMin = np.where(dataX[goodIdxMax]>minT)
return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin]
# take 'theory' and get values at every 'data' x point
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated thoeryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
# collect valid points
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY<(interpTheoryY+theoryErr))
withinLower = np.where(dataY[withinUpper]
>(interpTheoryY[withinUpper]-theoryErr))
return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower]
def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY>(interpTheoryY+theoryErr))
withinLower = np.where(dataY<(interpTheoryY-theoryErr))
return (dataX[withinUpper],dataY[withinUpper],
dataX[withinLower],dataY[withinLower])
if __name__ == "__main__":
dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData()
TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY)
interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX)
inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY,
theoryErr)
outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX,
TruncDataY,
interpTheoryY,
theoryErr)
#print inlierIndex
fig = plt.figure()
ax = fig.add_subplot(211)
ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
ax = fig.add_subplot(212)
ax.plot(inDataX,inDataY,'ko')
ax.plot(outUpX,outUpY,'bo')
ax.plot(outDownX,outDownY,'ro')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
fig.savefig('findInliers.png')
This figure is the result:
At the end I use some of the Yann code:
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated theoryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
def findOutlierSet(data,interpTheoryY,theoryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
up = np.where(data.dmodulus > (interpTheoryY+theoryErr))
low = np.where(data.dmodulus < (interpTheoryY-theoryErr))
# join all the index together in a flat array
out = np.hstack([up,low]).ravel()
index = np.array(np.ones(len(data),dtype=bool))
index[out]=False
datain = data[index]
dataout = data[out]
return datain, dataout
def selectdata(data,theoryX,theoryY):
"""
Data selection: z<1 and +-0.5 LFLRW separation
"""
# Select data with redshift z<1
data1 = data[data.redshift < 1]
# From modulus to light distance:
data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error)
# redshift data order
data1.sort(order='redshift')
# Outliers: distance to LFLRW curve bigger than +-0.5
theoryErr = 0.5
# Theory curve Interpolation to get the same points as data
interpy = theoryYatDataX(theoryX,theoryY,data1.redshift)
datain, dataout = findOutlierSet(data1,interpy,theoryErr)
return datain, dataout
Using those functions I can finally obtain:
Thank you all for your help.
Just look at the difference between the red curve and the points, if it is bigger than the difference between the red curve and the dashed red curve remove it.
diff=np.abs(points-red_curve)
index= (diff>(dashed_curve-redcurve))
filtered=points[index]
But please take the comment from NickLH serious. Your Data looks pretty good without any filtering, your "outlieres" all have a very big error and won't affect the fit much.
Either you could use the numpy.where() to identify which xy pairs meet your plotting criteria, or perhaps enumerate to do pretty much the same thing. Example:
x_list = [ 1, 2, 3, 4, 5, 6 ]
y_list = ['f','o','o','b','a','r']
result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5]
print result
I'm sure you could change the conditions so that '2' and '5' in the above example are the functions of your curves

Categories

Resources