I have a dataframe df with four columns=['ID','Lat','Lon','Elevation'] and n rows. Each row represent one point. I want to create four new columns=['Aggregated_ID','Lat_mean','Lon_mean','Ele_mean'] and add them to df so that points closer than a certain z-value have the same 'Aggregated_ID' and an average of Lat, Lon and Elevation on the other columns. I also have the distance matrix n x n between points. I tried this:
from scipy.cluster.hierarchy import fclusterdata
def create_ID_column(df, distances, z, start_id=0):
clusters = fclusterdata(distances, z, criterion='distance')
df['AggregatedID'] = np.char.add("Aggregate_", (clusters + start_id).astype(str))
return df, max(clusters) + start_id
def create_mean_coordinate_columns(df):
mean_coordinates = df.groupby('AggregatedID').mean().reset_index()
mean_coordinates = mean_coordinates[['AggregatedID', 'Longitude', 'Latitude','Elevation']]
mean_coordinates = mean_coordinates.rename(columns={'Longitude': 'Lon_mean', 'Latitude': 'Lat_mean','Elevation':'Ele_mean'})
df = df.merge(mean_coordinates, on='AggregatedID')
return df
z=1000
dist=n x n matrix # matrix of distances
df, start_id =create_ID_column(df, dist, z)
df= create_mean_coordinate_columns(df)
It works quite fast if I have few points, but now I need to do this operation on 60'000 points. Although I have 32 GB of RAM this code is using all of it and still running from yesterday. Is there a way to make it faster? I wrote both steps but the problem is only the apply of create_ID_column function which do the clustering. Thanks!
EDIT: I think the only option can be to use a different algorithm instead of fclusterdata but I don't know which one. Now I'm trying this:
from sklearn.cluster import AgglomerativeClustering
def create_ID_column(df, distances, z, start_id=0):
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=z, linkage='complete')
clustering.fit(distances)
clusters = clustering.labels_
df['AggregatedID'] = np.char.add("Aggregate_", (clusters + start_id).astype(str))
return df, max(clusters) + start_id
df, start_id = create_ID_column(df, dist, z)
I ran it, I hope it's faster but I'm not optimistic because it seems to have the same behavior. I don't understand why it's so difficult to obtain the result.. 60'000 points are a lot, but not so much.
Related
I have a number of spectra: wavelength/counts at a given temperature. The wavelength range is the same for each spectrum.
I would like to interpolate between the temperature and counts to create a large grid of spectra (temperature and counts (at a given wavelength range).
The code below is my current progress. When I try to get a spectrum for a given temperature I only get one value of counts when I need a range of counts representing the spectrum (I already know the wavelengths).
I think I am confused about arrays and interpolation. What am I doing wrong?
import pandas as pd
import numpy as np
from scipy import interpolate
image_template_one = pd.read_excel("mr_image_one.xlsx")
counts = np.array(image_template_one['counts'])
temp = np.array(image_template_one['temp'])
inter = interpolate.interp1d(temp, counts, kind='linear')
temp_new = np.linspace(30,50,0.5)
counts_new = inter(temp_new)
I am now think that I have two arrays; [wavelength,counts] and [wavelength, temperature]. Is this correct, and, do I need to interpolate between the arrays?
Example data
I think what you want to achieve can be done with interp2d:
from scipy import interpolate
# dummy data
data = pd.DataFrame({
'temp': [30]*6 + [40]*6 + [50]*6,
'wave': 3 * [a for a in range(400,460,10)],
'counts': np.random.uniform(.93,.95,18),
})
# make the interpolator
inter = interpolate.interp2d(data['temp'], data['wave'], data['counts'])
# scipy's interpolators return functions,
# which you need to call with the values you want interpolated.
new_x, new_y = np.linspace(30,50,100), np.linspace(400,450,100)
interpolated_values = inter(new_x, new_y)
My data object is an instance of:
class data_instance:
def __init__(self, data, tlabel):
self.data = data # 1xd numpy array
self.true_label = tlabel # integer {1,-1}
So far in code, I have a list called data_history full with data_istance and a set of centers (numpy array with shape (k,d)).
For a given data_instance new_data, I want:
1/ Get the nearest center to new_data from centers (by euclidean distance) let it be called Nearest_center.
2/ Iterate trough data_history and:
2.1/ select elements where the nearest center is Nearest_center (result of 1/) into list called neighbors.
2.2/ Get labels of object in neighbors.
Bellow is my code which work but it steel slow and I am looking for something more efficient.
My Code
For 1/
def getNearestCenter(data,centers):
if centers.shape != (1,2):
dist_ = np.sqrt(np.sum(np.power(data-centers,2),axis=1)) # This compute distance between data and all centers
center = centers[np.argmin(dist_)] # this return center which have the minimum distance from data
else:
center=centers[0]
return center
For 2/ (To optimize)
def getLabel(dataPoint, C, history):
labels = []
cluster = getNearestCenter(dataPoint.data,C)
for x in history:
if np.all(getNearestCenter(x.data,C) == cluster):
labels.append(x.true_label)
return labels
You should rather use the optimized cdist from scipy.spatial which is more efficient than calculating it with numpy,
from scipy.spatial.distance import cdist
dist = cdist(data, C, metric='euclidean')
dist_idx = np.argmin(dist, axis=1)
An even more elegant solution is to use scipy.spatial.cKDTree (as pointed out by #Saullo Castro in comments), which could be faster for a large dataset,
from scipy.spatial import cKDTree
tr = cKDTree(C)
dist, dist_idx = tr.query(data, k=1)
Found it:
dist_ = np.argmin(np.sqrt(np.sum(np.power(data[:, None]-C,2),axis=2)),axis=1)
This should return the index of the nearest center in centers from each data point of data.
I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)
I'd like to make a set of comparable empirical CDFs for a few numpy arrays (each of different length) and store these in a pandas dataframe:
a = scipy.randn(100)
b = scipy.randn(500)
# ECDF from statmodels
cdf_a = ECDF(a)
cdf_b = ECDF(b)
The problem is that cdf_a.x, cdf_a.y will be of different lengths of cdf_b.x, cdf_b.y and I would like these to be the same length, i.e. use same number of bins to compute the CDF so that these can be plotted on same scale from a pandas DataFrame. This is not possible:
df = pandas.DataFrame({"cdf_a": cdf_a.y, "cdf_b": cdf_b.y})
Since the cdfs are not of the same length. How can I bin a and b using similar bins when computing their CDFs, so that I get comparable same-length vectors back?
Is this the best solution?
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
The way we use it in some goodness of fit tests is to stack the arrays, so they are defined on all points, points from both arrays.
Then use np.searchsorted to get the ranking, number of points in dataset 1 below x and number of points in dataset 2 below x.
If I remember correctly, look at scipy.stats.ks_2samp
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
It appears that this is a good solution:
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
Then len(v1) == len(v2) and these can be plotted as CDFs of a, b on the same scale.
I need to compare some theoretical data with real data in python.
The theoretical data comes from resolving an equation.
To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib).
Both the theoretical curves and the data points are arrays of different length.
I can try to remove the points in a roughly-eye way, for example: the first upper point can be detected using:
data2[(data2.redshift<0.4)&data2.dmodulus>1]
rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])
But I would like to use a less roughly-eye way.
So, can anyone help me finding an easy way of removing the problematic points?
Thank you!
This might be overkill and is based on your comment
Both the theoretical curves and the data points are arrays of
different length.
I would do the following:
Truncate the data set so that its x values lie within the max and min values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d.
Use numpy.where to find data y values that are out side the range of acceptable theory values.
DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the 'inliners' one color and the 'outliers' an other color.
Here's a script that is close to what you are looking for, I think. It hopefully will help you accomplish what you want:
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
# make up data
def makeUpData():
'''Make many more data points (x,y,yerr) than theory (x,y),
with theory yerr corresponding to a constant "sigma" in y,
about x,y value'''
NX= 150
dataX = (np.random.rand(NX)*1.1)**2
dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX
dataErr = np.random.rand(NX)*dataX*1.3
theoryX = np.arange(0,1,0.1)
theoryY = theoryX*theoryX*1.5
theoryErr = 0.5
return dataX,dataY,dataErr,theoryX,theoryY,theoryErr
def makeSameXrange(theoryX,dataX,dataY):
'''
Truncate the dataX and dataY ranges so that dataX min and max are with in
the max and min of theoryX.
'''
minT,maxT = theoryX.min(),theoryX.max()
goodIdxMax = np.where(dataX<maxT)
goodIdxMin = np.where(dataX[goodIdxMax]>minT)
return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin]
# take 'theory' and get values at every 'data' x point
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated thoeryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
# collect valid points
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY<(interpTheoryY+theoryErr))
withinLower = np.where(dataY[withinUpper]
>(interpTheoryY[withinUpper]-theoryErr))
return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower]
def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY>(interpTheoryY+theoryErr))
withinLower = np.where(dataY<(interpTheoryY-theoryErr))
return (dataX[withinUpper],dataY[withinUpper],
dataX[withinLower],dataY[withinLower])
if __name__ == "__main__":
dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData()
TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY)
interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX)
inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY,
theoryErr)
outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX,
TruncDataY,
interpTheoryY,
theoryErr)
#print inlierIndex
fig = plt.figure()
ax = fig.add_subplot(211)
ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
ax = fig.add_subplot(212)
ax.plot(inDataX,inDataY,'ko')
ax.plot(outUpX,outUpY,'bo')
ax.plot(outDownX,outDownY,'ro')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
fig.savefig('findInliers.png')
This figure is the result:
At the end I use some of the Yann code:
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated theoryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
def findOutlierSet(data,interpTheoryY,theoryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
up = np.where(data.dmodulus > (interpTheoryY+theoryErr))
low = np.where(data.dmodulus < (interpTheoryY-theoryErr))
# join all the index together in a flat array
out = np.hstack([up,low]).ravel()
index = np.array(np.ones(len(data),dtype=bool))
index[out]=False
datain = data[index]
dataout = data[out]
return datain, dataout
def selectdata(data,theoryX,theoryY):
"""
Data selection: z<1 and +-0.5 LFLRW separation
"""
# Select data with redshift z<1
data1 = data[data.redshift < 1]
# From modulus to light distance:
data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error)
# redshift data order
data1.sort(order='redshift')
# Outliers: distance to LFLRW curve bigger than +-0.5
theoryErr = 0.5
# Theory curve Interpolation to get the same points as data
interpy = theoryYatDataX(theoryX,theoryY,data1.redshift)
datain, dataout = findOutlierSet(data1,interpy,theoryErr)
return datain, dataout
Using those functions I can finally obtain:
Thank you all for your help.
Just look at the difference between the red curve and the points, if it is bigger than the difference between the red curve and the dashed red curve remove it.
diff=np.abs(points-red_curve)
index= (diff>(dashed_curve-redcurve))
filtered=points[index]
But please take the comment from NickLH serious. Your Data looks pretty good without any filtering, your "outlieres" all have a very big error and won't affect the fit much.
Either you could use the numpy.where() to identify which xy pairs meet your plotting criteria, or perhaps enumerate to do pretty much the same thing. Example:
x_list = [ 1, 2, 3, 4, 5, 6 ]
y_list = ['f','o','o','b','a','r']
result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5]
print result
I'm sure you could change the conditions so that '2' and '5' in the above example are the functions of your curves