I have two different x vs y data sets in Python, where x is wavelength and y is flux. Example:
import numpy as np
wv_arr_1 = np.array([5564.0641521, 5566.43488632, ..., 8401.83301412])
flux_arr_1 = np.array([2.7731672e-15, 2.7822637e-15, ..., 8.0981220e-16])
wv_arr_2 = np.array([5109.3259116, 5111.34467782, ..., 7529.82661321])
flux_arr_2 = np.array([2.6537110e-15, 3.7101513e-15, ..., 2.9433518e-15])
where ... represents many additional numbers in between, and the arrays might not necessarily be the same lengths. I would like to essentially average my two data sets (the flux values), which would be easy if the wavelength scales were exactly the same. But since they're not, I'm unsure of the best way to approach this. I want to end up with one wavelength array and one flux array that encapsulates the average of my two data sets, but of course the values can only be averaged at the same (or close enough) wavelengths. What is a Pythonic way to do this?
Your question is a bit open-ended from a scientific point of view. What you want to do only makes complete sense if the two datasets should correspond to the same underlying function almost exactly, so that noise is negligible.
Anyway, the first thing you can do is map both of your datasets to a common wavelength array. For this you need to interpolate both sets of data on a 1d grid of wavelengths of your choosing. Again if the data is too noisy then interpolation won't make much sense. But if the datasets are smooth then you can get away even with linear interpolation. Once you have both datasets interpolated onto a common wavelength grid, you can trivially take their average. Note that this will only work if the sampling density is large enough that any larger features in the spectra are well-mapped by both individual datasets.
If your data is too noisy perhaps the only reasonable thing you can do is to take the union of the datasets, and fit a function from an educated guess onto the joint spectrum. For this you will have to have a very good idea of what your data should look like, but I don't think there's a general-purpose solution that can help you in this case, not without introducing uncontrolled artifacts into your data.
Related
I'm looking for some advice on how to implement some statistical models in Python. I'm interested in constructing a sequence of z values (z_1,z_2,z_3,...,z_n) where the number of jumps in an interval (z_1,z_2] is distributed according to the Poisson distribution with parameter lambda(z_2-z_1)
and the numbers of random jumps over disjoint intervals are independent random variables. I want my piecewise constant plot to look something like the two images below, where the y axis is Y(z), where Y(z) consists of N(0,1) random variables in each interval say.
To construct the z data, what would be the best way to tackle this? I have tried sampling values via np.random.poisson and then taking a cumulative sum, but the values drawn are repeated for small intensity values. Please any help or thoughts would be really helpful. Thanks.
np.random.poisson is used to sample the count of events that occured in [z_i, z_j). if you want to sample the events as they occur, then you just want the exponential distribution. for example:
import numpy as np
n = 50
z = np.cumsum(np.random.exponential(1/n, size=n))
y = np.random.normal(size=n)
plotting these (using step in matplotlib) gives something similar to your plots:
note the 1/n sets a "lambda" so on average we expect n points within [0,1]. in this case we got slightly less so it overshoot. feel free to rescale if that's important to you
I've got a multidimensional array that has 1 million sets of 3 points, each point being a coordinate specified by x and y. Calling this array pointVec, what I mean is
np.shape(pointVec) = (1000000,3,2)
I want to find the center of each of the set of 3 points. One obvious way is to iterate through all 1 million sets, finding the center of each set at each iteration. However, I have heard that vectorization is a strong-suit of Numpy's, so I'm trying to adapt it to this problem. Since this problem fits so intuitively with iteration, I don't have a grasp of how one might do it with vectorization, or if using vectorization would even be useful.
It depends how you define a center of a three-point. However, if it is average coordinates, like #Quang mentioned in the comments, you can take the average along a specific axis in numpy:
pointVec.mean(1)
This will take the mean along axis=1 (which is second axis with 3 points) and return a (1000000,2) shaped array.
Is it somehow possible to determine the array length of the arrays in the tck tuple returned by scipy.interpolate.splprep before computing the values?
I have to fit a spline interpolation to noisy data with 5 million data points (or less, can be varying).
My observation is that the interpolation at an array length of ~ 90 is pretty good, while it takes a long time to compute the interpolation for higher array lengths (it sometimes also directly jumps from ~ 90 to ~ 1000 while making s one step smaller and the interpolation also becomes noisy) and it is not appropriate enough, if the array length is far less (<50)...
Actually, this array length depends on the smoothing factor s provided to the splprep function, but for different measurement data, s varies a lot to get a consistent array length of around 90. E.g. for data1 s has a value of around 1000 to get len(cfk[0]) equals to 90, for data2 s has a value of around 100 to get len(cfk[0]) equals to 90 at same lengths of data1 and data2. It might be dependent on the noise of the data...
I have thought about a loop where s starts at some point and decreases through the loop while len(cfk[0]) is constantly being checked - but this takes ages, especially if len(cfk[0]) gets closer to 90.
Therefore, it would be useful to somehow know the smoothing factor to get the desired array length before computing the cfk tuple.
Short answer: no, not easily. Dierckx Fortran library, which splrep wraps, uses some fairly non-trivial logic for determining the knot vector, and it's all baked into the Fortran code. So, the only way is to carefully trace the latter. It's available from netlib, also scipy/interpolate/fitpack
I'm having trouble understanding how to begin my solution. I have a matrix with 569 rows, each representing a single sample of my data, and 30 columns representing the features of each sample. My intuition is to plot each individual row, and see what the clusters (if any) look like, but I can't figure out how to do more than 2 rows on a single scatter plot.
I've spent several hours looking through tutorials, but have not been able to understand how to apply it to my data. I know a scatter plot takes 2 vectors as a parameter, so how could I possibly plot all 569 samples to cluster them? Am I missing something fundamental here?
#our_data is a 2-dimensional matrix of size 569 x 30
plt.scatter(our_data[0,:], our_data[1,:], s = 40)
My goal is to start k means clustering on the 569 samples.
Since you have a 30-dimensinal factor space, it is difficult to plot such data in 2D space (i.e. on canvas). In such cases usually apply dimension reduction techniques first. This could help to understand data structure. You can try to apply,e.g. PCA (principal component analysis) first, e.g.
#your_matrix.shape = (569, 30)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected_data = pca.fit_transform(your_matrix)
plt.scatter(projected_data[:,0], projected_data[:, 1]) # This might be very helpful for data structure understanding...
plt.show()
You can also look on other (including non-linear) dimension reduction techniques, such as, e.g. T-sne.
Further you can apply k-means or something else; or apply k-means to projected data.
If by initialize you mean picking the k initial clusters, one of the common ways of doing so is to use K-means++ described here which was developed in order to avoid poor clusterings.
It essentially entails semi-randomly choosing centers based upon a probability distribution of distances away from a first center that is chosen completely randomly.
I am trying to make an algorithm using just numpy (i saw others using PIL, but it has some drawbacks) that can compare and plot the difference between two maps that show ice levels from different years. I load the images and set NaNs to zero, as I have some.
data = np.load(filename)
data[np.isnan(data)]=0
The data arrays contain values between 0 and 100 and represent concentration levels (100 is the deep blue).
The data looks like this:
I am trying to compute the difference so that a loss in ice over time will correspond to a negative value, and a gain in ice will correspond to a positive value. The ice is denoted by the blue color in the plots above.
Any hints? Comparing element by element seems to be not the best idea...
To get the difference between 2 same sized numpy arrays of data, just take one from the other:
diff = img1 - img2
Numpy is basically a Python wrapper for an underlying C code base, designed for these sorts of operations. Although underneath it is comparing element to element (as you say above); it is significantly faster at these sorts of operations.