I have calculated the boundaries in which I want to sample points.
For example one dataset looks like:
Now I want to find point in the red area, which I do in the following way:
The plot consist of 10 lines, so I reshape to get the region limits per value of x.
limits = data.reshape(data.shape + (5, 2))
Now for a particular value of x data looks like:
limits[20] = array([[ 5.65624197, 6.70331962],
[ 13.68248989, 14.77227669],
[ 15.50973796, 16.61491606],
[ 24.03948128, 25.14907398],
[ 26.41541777, 27.53475798]])
I thought to make a mesh and mask the area as following
X, Y = np.meshgrid(xs, ys)
bool_array = np.zeros(Y.shape)
for j, y in enumerate(limits):
for min_y, max_y in y:
inds = np.where(np.logical_and(ys >= min_y, ys <= max_y))[0]
bool_array[inds, j] = True
plt.imshow(bool_array[::-1])
(don't know why the graph need to be plotted inverted)
results in
which is indeed the data I'm looking for , now I could use the True values to take points with a different function.
The problem is that this code is very slow, and my datasets will get much bigger.
I would like to find a more efficient way of finding this "mask".
I tried several things and ended up with the following result which worked for my simple cases
low_bound = limits[:,:,0]
upp_bound = limits[:,:,1]
mask = np.any((low_bound[:,None,:] <= Y.T[:,:,None]) & ( Y.T[:,:,None] <= upp_bound[:,None,:]),axis=-1).T
I know it looks ugly. What I do is introducing an additional dimension in which I subsequently check conditions whether it lies between two ending points. At the end I collapse the additional dimension by using np.any.
I don't know how much faster it is compared to your code. However, given that I don't use a single for loop there should be a performance boost.
Check the code with your data and tell me if something goes wrong.
Edit:
plt.imshow plots (0,0) in the lower left edge when you use
plt.imshow(mask,origin='lower')
Related
I am a python newbie, trying to understand how to work with numpy masks better.
I have two 2D data arrays plus axis values, so something like
import numpy as np
data1=np.arange(50).reshape(10,5)
data2=np.random.rand(10,5)
x=5*np.arange(5)+15
y=2*np.arange(10)
Where x contains the coordinates of the 1st axis of data1 and data2, and y gives the coordinates of the 2nd axis of data1 and data2.
I want to identify and count all the points in data1 for which
data1>D1min,
the corresponding x values are inside a given range, XRange, and
the corresponding y is are inside a given range, YRange
Then, when I am all done, I also need to do a check to make sure none of the corresponding data2 values are less than another limit, D2Max
so if
XRange = [27,38]
YRange = [2,12]
D1min = 23
D2Max = 0.8
I would want to include cells 3:4 in the x direction and 1:6 in the 2nd dimension (assuming I want to include the limiting values).
That means I would only consider data1[3:4,1:6]
Then the limits of the values in the 2D arrays come into it, so want to identify and count points for which data1[3:4,1:6] > 23.
Once I have done that I want to take those data locations and check to see if any of those locations have values <0.8 in data2.
In reality I don't have formulas for x and y, and the arrays are much larger. Also, x and y might not even be monotonic.
I figure I should use numpy masks for this and I have managed to do it, but the result seems really tortured - I think the code wold be more clear if I just looped though the values in the 2D arrays.
I think the main problem is that I have trouble combining masks with boolean operations. The ideas I get from searching on line often don't seem to work on arrays.
I assume there is a elegant and (hopefully) understandable way to do this in just a few lines with masks. Would anyone care to explain it to me?
Well I eventually came up with something, so I thought I'd post it. I welcome suggested improvements.
#expand x and y into 2D arrays so that they can more
#easily be used for masking using tile
x2D = np.tile(x,(len(y),1))
y2D = np.tile(y,(len(x),1)).T
#mask these based on the ranges in X and Y
Xmask = np.ma.masked_outside(x2D,XRange[0],XRange[1]).mask
Ymask = np.ma.masked_outside(y2D,YRange[0],YRange[1]).mask
#then combine them
#Not sure I need the shrink=False, but it seems safer
XYmask = np.ma.mask_or(Xmask, Ymask,shrink=False)
#now mask the data1 array based on D1mask.
highdat = np.ma.masked_less(data1,D1min)
#combine with XYmask
data1mask = np.ma.mask_or(highdat.mask, XYmask,shrink=False)
#apply to data1
data1masked = np.ma.masked_where(data1mask,data1)
#number of points fulfilling my criteria
print('Number of points: ',np.ma.count(data1masked))
#transfer mask from data1 to data2
data2masked = np.ma.masked_where(data1mask, data2)
#do my check based on data2
if data2masked.min() < D2Max: print('data2 values are low!')
This question has a lot of useful answers on how to get a moving average.
I have tried the two methods of numpy convolution and numpy cumsum and both worked fine on an example dataset, but produced a shorter array on my real data.
The data are spaced by 0.01. The example dataset has a length of 50, the real data tens of thousands. So it must be something about the window size that is causing the problem and I don't quite understand what is going on in the functions.
This is how I define the functions:
def smoothMAcum(depth,temp, scale): # Moving average by cumsum, scale = window size in m
dz = np.diff(depth)
N = int(scale/dz[0])
cumsum = np.cumsum(np.insert(temp, 0, 0))
smoothed=(cumsum[N:] - cumsum[:-N]) / N
return smoothed
def smoothMAconv(depth,temp, scale): # Moving average by numpy convolution
dz = np.diff(depth)
N = int(scale/dz[0])
smoothed=np.convolve(temp, np.ones((N,))/N, mode='valid')
return smoothed
Then I implement it:
scale = 5.
smooth = smoothMAconv(dep,data, scale)
but print len(dep), len(smooth)
returns 81071 80572
and the same happens if I use the other function.
How can I get the smooth array of the same length as the data?
And why did it work on the small dataset? Even if I try different scales (and use the same for the example and for the data), the result in the example has the same length as the original data, but not in the real application.
I considered an effect of nan values, but if I have a nan in the example, it doesn't make a difference.
So where is the problem, if possible to tell without the full dataset?
The second of your approaches is easy to modify to preserve the length, because numpy.convolve supports the parameter mode='same'.
np.convolve(temp, np.ones((N,))/N, mode='same')
This is made possible by zero-padding the data set temp on both sides, -
which will inevitably have some effect at the boundaries unless your data happens to be 0 near the boundaries. Example:
N = 10
x = np.linspace(0, 2, 100)
y = x**2 + np.random.uniform(size=x.shape)
y_smooth = np.convolve(y, np.ones((N,))/N, mode='same')
plt.plot(x, y, 'r.')
plt.plot(x, y_smooth)
plt.show()
The boundary effect of zero-padding is very visible at the right end, where the data points are about 4-5 but are padded by 0.
To reduce this undesired effect, use numpy.pad for more intelligent padding; reverting to mode='valid' for convolution. The pad width must be such that in total N-1 elements are added, where N is the size of moving window.
y_padded = np.pad(y, (N//2, N-1-N//2), mode='edge')
y_smooth = np.convolve(y_padded, np.ones((N,))/N, mode='valid')
Padding by edge values of an array looks much better.
I have a function (f : black line) which varies sharply in a specific, small region (derivative f' : blue line, and second derivative f'' : red line). I would like to integrate this function numerically, and if I distribution points evenly (in log-space) I end up with fairly large errors in the sharply varying region (near 2E15 in the plot).
How can I construct an array spacing such that it is very well sampled in the area where the second derivative is large (i.e. a sampling frequency proportional to the second derivative)?
I happen to be using python, but I'm interested in a general algorithm.
Edit:
1) It would be nice to be able to still control the number of sampling points (at least roughly).
2) I've considered constructing a probability distribution function shaped like the second derivative and drawing randomly from that --- but I think this will offer poor convergence, and in general, it seems like a more deterministic approach should be feasible.
Assuming f'' is a NumPy array, you could do the following
# Scale these deltas as you see fit
deltas = 1/f''
domain = deltas.cumsum()
To account only for order of magnitude swings, this could be adjusted as follows...
deltas = 1/(-np.log10(1/f''))
I'm just spitballing here ... (as I don't have time to try this out for real)...
Your data looks (roughly) linear on a log-log plot (at least, each segment seems to be... So, I might consider doing a sort-of integration in log-space.
log_x = log(x)
log_y = log(y)
Now, for each of your points, you can get the slope (and intercept) in log-log space:
rise = np.diff(log_y)
run = np.diff(log_x)
slopes = rise / run
And, similarly, the the intercept can be calculated:
# y = mx + b
# :. b = y - mx
intercepts = y_log[:-1] - slopes * x_log[:-1]
Alright, now we have a bunch of (straight) lines in log-log space. But, a straight line in log-log space, corresponds to y = log(intercept)*x^slope in real space. We can integrate that easily enough: y = a/(k+1) x ^ (k+1), so...
def _eval_log_log_integrate(a, k, x):
return np.log(a)/(k+1) * x ** (k+1)
def log_log_integrate(a, k, x1, x2):
return _eval_log_log_integrate(a, k, x2) - _eval_log_log_integrate(a, k, x1)
partial_integrals = []
for a, k, x_lower, x_upper in zip(intercepts, slopes, x[:-1], x[1:]):
partial_integrals.append(log_log_integrate(a, k, x_lower, x_upper))
total_integral = sum(partial_integrals)
You'll want to check my math -- It's been a while since I've done this sort of thing :-)
1) The Cool Approach
At the moment I implemented an 'adaptive refinement' approach inspired by hydrodynamics techniques. I have a function which I want to sample, f, and I choose some initial array of sample points x_i. I construct a "sampling" function g, which determines where to insert new sample points.
In this case I chose g as the slope of log(f) --- since I want to resolve rapid changes in log space. I then divide the span of g into L=3 refinement levels. If g(x_i) exceeds a refinement level, that span is subdivided into N=2 pieces, those subdivisions are added into the samples and are checked against the next level. This yields something like this:
The solid grey line is the function I want to sample, and the black crosses are my initial sampling points.
The dashed grey line is the derivative of the log of my function.
The colored dashed lines are my 'refinement levels'
The colored crosses are my refined sampling points.
This is all shown in log-space.
2) The Simple Approach
After I finished (1), I realized that I probably could have just chosen a maximum spacing in in y, and choose x-spacings to achieve that. Similarly, just divide the function evenly in y, and find the corresponding x points.... The results of this are shown below:
A simple approach would be to split the x-axis-array into three parts and use different spacing for each of them. It would allow you to maintain the total number of points and also the required spacing in different regions of the plot. For example:
x = np.linspace(10**13, 10**15, 100)
x = np.append(x, np.linspace(10**15, 10**16, 100))
x = np.append(x, np.linspace(10**16, 10**18, 100))
You may want to choose a better spacing based on your data, but you get the idea.
This might seem a bit strange, but I really feel like there should be a relatively straightforward solution to it. Basically I've got an image in the form of a 3D numpy array (x, y, color). I was following along with this tutorial for a slightly different product area, and found that these methods did not extend well.
As a result, I'm making a modified edge detection algorithm for my use case. As of now this is just some basic signal processing on top of a 1d array. This works great if I only want to sample in the x and y directions, as I can just use the existing rows and columns of the array.
However, to determine orientation of these edges, I would like to be able to sample any arbitrary vector across the image below is an image to help illustrate:
I tried hacking together something that would just append pixels as it crossed them, but it was inefficient, inelegant, and non-ideal in a number of ways. I feel like there must be some relatively elegant way of doing this.
Any ideas? The size of the sample across the vector doesn't really matter to me if that makes things any easier.
I would make an equation for the line you want to cut along, then make a mask around it and keep all pixels that come within some width of it. For example, say you want a cut along i = 2*j + 34, where i and j are measured in pixels:
h, w = im.shape[:2]
width = 2 # width of slice in pixels, too narrow and it will have gaps
i, j = np.ogrid[:h, :w]
mask = np.abs(2*j + 34 - i) < width
im[mask]
Note that im[mask] will be a 2d array, since it should still have the colors. It will be ordered so that the uppermost pixels are first, and the bottom pixels are last, opposite of that shown in your arrow, unless of course you have origin=lower in your plotting :) And if several pixels are selected in each row (if width > 1), then they'll go left to right, so the shape for a slice like your drawing would be a tiny sequence of z's, and for the other direction, backwards z's (s's?).
Keep in mind that for an array there doesn't exist a diagonal slice without some weird zigzag (or alternatively, interpolation) no matter how elegant your implementation is. You could rotate the image (by some algorithm) and take a horizontal slice.
Using the equation
x2 = x1 + length * cos(θ)
y2 = y1 + length * sin(θ)
where
θ = angle * 3.14 / 180.0
You can iterate through the pixel using angle and length like
int angle =45; //angle of iteration
int length = 0; //Alternately you can skip the pixel by giving value other than 0
Point P1(starX,startY); //Your starting point.
Point P2;//??
while(1){
length++;
P2.x = (int)round(P1.x + length * cos(angle * CV_PI / 180.0));
P2.y = (int)round(P1.y + length * sin(angle * CV_PI / 180.0));
if(P2_exceed_boundary()) break;
do_Whatever_with_P2();
}
So this is a little follow up question to my earlier question: Generate coordinates inside Polygon and my answer https://stackoverflow.com/a/15243767/1740928
In fact, I want to bin polygon data to a regular grid. Therefore, I calculate a couple of coordinates within the polygon and translate their lat/lon combination to their respective column/row combo of the grid.
Currently, the row/column information is stored in a numpy array with its number of rows corresponding to the number of data polygons and its number of columns corresponding to the coordinates in the polygon.
The whole code takes less then a second, but this code is the bottleneck at the moment (with ~7sec):
for ii in np.arange(len(data)):
for cc in np.arange(data_lats.shape[1]):
final_grid[ row[ii,cc], col[ii,cc] ] += data[ii]
final_grid_counts[ row[ii,cc], col[ii,cc] ] += 1
The array "data" simply contains the data values for each polygon (80000,). The arrays "row" and "col" contain the row and column number of a coordinate in the polygon (shape: (80000,16)).
As you can see, I am summing up all data values within each grid cell and count the number of matches. Thus, I know the average for each grid cell in case different polygons intersect it.
Still, how can these two for loops take around 7 seconds? Can you think of a faster way?
I think numpy should add an nd-bincount function, I had one lying around from a project I was working on some time ago.
import numpy as np
def two_d_bincount(row, col, weights=None, shape=None):
if shape is None:
shape = (row.max() + 1, col.max() + 1)
row = np.asarray(row, 'int')
col = np.asarray(col, 'int')
x = np.ravel_multi_index([row, col], shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
return out.reshape(shape)
weights = np.column_stack([data] * row.shape[1])
final_grid = two_d_bincount(row.ravel(), col.ravel(), weights.ravel())
final_grid_counts = two_d_bincount(row.ravel(), col.ravel())
I hope this helps.
I might not fully understand the shapes of your different grids, but you can maybe eliminate the cc loop using something like this:
final_grid = np.empty((nrows,ncols))
for ii in xrange(len(data)):
final_grid[row[ii,:],col[ii,:]] = data[ii]
This of course assumes that final_grid is starting with no other info (that the count you're incrementing starts at zero). And I'm not sure how to test if it works not understanding how your row and col arrays work.