Data comparison using numpy - python

I am trying to make an algorithm using just numpy (i saw others using PIL, but it has some drawbacks) that can compare and plot the difference between two maps that show ice levels from different years. I load the images and set NaNs to zero, as I have some.
data = np.load(filename)
data[np.isnan(data)]=0
The data arrays contain values between 0 and 100 and represent concentration levels (100 is the deep blue).
The data looks like this:
I am trying to compute the difference so that a loss in ice over time will correspond to a negative value, and a gain in ice will correspond to a positive value. The ice is denoted by the blue color in the plots above.
Any hints? Comparing element by element seems to be not the best idea...

To get the difference between 2 same sized numpy arrays of data, just take one from the other:
diff = img1 - img2
Numpy is basically a Python wrapper for an underlying C code base, designed for these sorts of operations. Although underneath it is comparing element to element (as you say above); it is significantly faster at these sorts of operations.

Related

Iteratively Optimize SMA Smoothing?

What would be an efficient approach to iterating through simple moving average (SMA) filtering on a mild dataset (<10,000 elements)?
I'm trying to remove vertical tangents and extreme peaks from my dataset, while retaining as much resolution as possible. From a process standpoint, my plan was to use scipy's simpsons rule integration to compare the area under the original noisy curve to the SMA applied curve. This works well in my process due to the inherent data properties. I'm using Pandas to calculate the SMA. I'd like to iteratively change the window length (fixed integer), until the error is minimized -- where error = (area under original curve - area under SMA curve)**2.0
Unfortunately, pandas does not accept an array of windows. To ensure I've hit an acceptable target, I plan to compare the error calculated at each window value, and select the one with the smallest error. What would be a code-efficient way to iteratively compare?
This is an example of what I have currently.
noise_data_x = [1,1.1,1,1.2,1.3,1.4,1.5,1.6.........100]
noise_data_y = [2.1,3.4,3.2,4.7,................2.1,5.7]
SMA_data_y = pd.DataFrame(noise_data_y).rolling(window=4).mean()
SMA_data_y_array = []
SMA_data_x_array = []
for i in range(len(SMA_data_y)):
#Drop NAN
if np.isnan(SMA_data_y.iloc[i,0]) == False:
SMA_data_x_array.append(noise_data_x[i])
SMA_data_y_array.append(SMA_data_y.iloc[i,0])
data_cleaned = sci.integrate.simpson(SMA_data_y_array, SMA_data_x_array)
print(data_cleaned)
data_original = sci.integrate.simpson(noise_data_y, noise_data_x)
error = ((data_cleaned - data_original))**2.0
This code works for a one-off approach, but how would you go about iteratively looking at windows for i in range (2,200) with this type of error reduction?
Surely there's a better way than duplicating the array hundreds of times.
I've looked at using a for loop to pass an array of arrays using np.tile(), but have not had success. Thoughts?

Python, fast computation of rolling percentile

Given a multidimensional array, I want to compute a rolling percentile over one of its axes, with the rolling windows truncated near the boundaries of the array. Below is a minimal example implementation using only numpy via np.nanpercentile() applied to stacked, rolled (through np.roll()) arrays. However, the input array may be very large (~ 1 GB or more), so two issues arise:
For the current implementation, the stacked, rolled array may
not fit into RAM memory. Avoidable with for-loops over all axes
unaffected by the rolling, but may be slow.
Even fully vectorized (as below), the computation time is quite long,
understandably due to the sheer amount of computations performed.
Questions: Is there a more efficient python implementation of a rolling percentile (with axis/axes argument or the like and with truncated windows near the boundaries)?** If not, how could the computation be sped up (and, if possible, without exceeding the RAM)? C-code called from Python? Computation of percentiles at fewer "central" points, and approximation in between via (e.g. linear) interpolation? Other ideas?
Related post (implementing rolling percentiles): How to compute moving (or rolling, if you will) percentile/quantile for a 1d array in numpy? Issues are:
pandas implementation via pd.Series().rolling().quantile() works only for pd.Series or pd.DataFrame objects, not multidimensional (4D or arbitrary D) arrays;
implementation via np.lib.stride_tricks.as_strided() with np.nanpercentile() is similar to the one below and should not be much faster given that np.nanpercentile() is the speed bottleneck, see below
Minimal example implementation:
import numpy as np
np.random.seed(100)
# random array of numbers
a = np.random.rand(10000,1,70,70)
# size of rolling window
n_window = 150
# percentile to compute
p = 0.7
# NaN values to prepend/append to array before rolling
nan_temp = np.full(tuple([n_window] + list(np.array(a.shape)[1:])), fill_value=np.nan)
# prepend and append NaN values to array
a_temp = np.concatenate((nan_temp, a, nan_temp), axis=0)
# roll array, stack rolled arrays along new dimension, compute percentile (ignoring NaNs) using np.nanpercentile()
res = np.nanpercentile(np.concatenate([np.roll(a_temp, shift=i, axis=0)[...,None] for i in range(-n_window, n_window+1)],axis=-1),p*100,axis=-1)
# cut away the prepended/appended NaN values
res = res[n_window:-n_window]
Computation times (in seconds), example (for the case of a having a shape of (1000,1,70,70) instead of (10000,1,70,70)):
create random array: 0.0688176155090332
prepend/append NaN values: 0.03478217124938965
stack rolled arrays: 38.17830514907837
compute nanpercentile: 1145.1418626308441
cut out result: 0.0004646778106689453

Vectorize finding center of sets of points in multidimensional array in Numpy

I've got a multidimensional array that has 1 million sets of 3 points, each point being a coordinate specified by x and y. Calling this array pointVec, what I mean is
np.shape(pointVec) = (1000000,3,2)
I want to find the center of each of the set of 3 points. One obvious way is to iterate through all 1 million sets, finding the center of each set at each iteration. However, I have heard that vectorization is a strong-suit of Numpy's, so I'm trying to adapt it to this problem. Since this problem fits so intuitively with iteration, I don't have a grasp of how one might do it with vectorization, or if using vectorization would even be useful.
It depends how you define a center of a three-point. However, if it is average coordinates, like #Quang mentioned in the comments, you can take the average along a specific axis in numpy:
pointVec.mean(1)
This will take the mean along axis=1 (which is second axis with 3 points) and return a (1000000,2) shaped array.

Average two arrays with different underlying x-scales in Python

I have two different x vs y data sets in Python, where x is wavelength and y is flux. Example:
import numpy as np
wv_arr_1 = np.array([5564.0641521, 5566.43488632, ..., 8401.83301412])
flux_arr_1 = np.array([2.7731672e-15, 2.7822637e-15, ..., 8.0981220e-16])
wv_arr_2 = np.array([5109.3259116, 5111.34467782, ..., 7529.82661321])
flux_arr_2 = np.array([2.6537110e-15, 3.7101513e-15, ..., 2.9433518e-15])
where ... represents many additional numbers in between, and the arrays might not necessarily be the same lengths. I would like to essentially average my two data sets (the flux values), which would be easy if the wavelength scales were exactly the same. But since they're not, I'm unsure of the best way to approach this. I want to end up with one wavelength array and one flux array that encapsulates the average of my two data sets, but of course the values can only be averaged at the same (or close enough) wavelengths. What is a Pythonic way to do this?
Your question is a bit open-ended from a scientific point of view. What you want to do only makes complete sense if the two datasets should correspond to the same underlying function almost exactly, so that noise is negligible.
Anyway, the first thing you can do is map both of your datasets to a common wavelength array. For this you need to interpolate both sets of data on a 1d grid of wavelengths of your choosing. Again if the data is too noisy then interpolation won't make much sense. But if the datasets are smooth then you can get away even with linear interpolation. Once you have both datasets interpolated onto a common wavelength grid, you can trivially take their average. Note that this will only work if the sampling density is large enough that any larger features in the spectra are well-mapped by both individual datasets.
If your data is too noisy perhaps the only reasonable thing you can do is to take the union of the datasets, and fit a function from an educated guess onto the joint spectrum. For this you will have to have a very good idea of what your data should look like, but I don't think there's a general-purpose solution that can help you in this case, not without introducing uncontrolled artifacts into your data.

How to get the number of unique area within a kernel in connected component labeled map using python?

I would like to get the number of area in connected component labeled map.
For example:
[Image value]
00011100022200
00011000002220
00000033300220
44000033000020
44000000000000
And if kernel size is 3x3, I would like to get the number of unique value in a same size 2D array.
For example, new 2D array has following value:
newarray[3,3]=2
newarray[2,4]=1
newarray[2,2]=0
I am working with Python, any ideas?
This could be done with numpy, which has a nice syntax for getting submatrices from a matrix.
Let's assume we have the input data:
101
200
000
You can define this in numpy and retrieve the top-left 2x2-submatrix:
import numpy as np
data = np.array([[1,0,1],[2,0,0],[0,0,0]])
submatrix = data[0:2,0:2]
Your kernel function seems to be "number of unique values not equal to zero in kernel area". This can be calculated with standard python functions:
values = list(submatrix.flatten())
without_zeros = list(filter(None, values))
unique_values_not_zero = set(without_zeros)
From here, you should be able to continue again. Here is a list of the open issues that still need to be implemented now:
You need to implement a loop to apply the calculation above to each submatrix in the whole matrix (it will be two loops nested).
You need to deal with the borders. usually, I think in image processing they want the result matrix to be of the same size as the input matrix. So you have to find a way to handle the areas where your kernel does overlap with areas that are not defined in the matrix anymore. E.g. when you want to set the value for field 0x0 in your example.
You could create two distinct functions apply_kernel_to_matrix and unique_values_kernel, so that you could add more kernels later and just reuse the apply_kernel_to_matrix function later. Read about lambda functions for this.

Categories

Resources