I have this distribution of points (allPoints, which is a list of lists: [[x1,y1][x2,y2][x3,y3][x4,y4]...[xn,yn]]):
From which I'd like to select points, randomly.
in Python I would do something like:
from random import *
point = choice(allPoints)
Except, I need the random pick to not be biased by the existing density. For instance, here, "choice" would tend to pick a point in the upmost-leftmost part of the plot.
How can I, in Python, get rid of this bias?
I've tried to divide the space in portions of size "div", and then, sample within this portion, but in many cases, no points exist at all and the while loop doesn't find any solution:
def column(matrix, i):
return [row[i] for row in matrix]
div = 10
min_x,max_x = min(column(allPoints,0)),max(column(allPoints,0))
min_y, max_y = min(column(allPoints,1)),max(column(allPoints,1))
zone_x_min = randint(1,div-1) * (max_x - min_x) / div + min_x
zone_x_max = zone_x_min + (max_x - min_x) / div
zone_y_min = randint(1,div-1) * (max_y - min_y) / div + min_y
zone_y_max = zone_yl_min + (max_y - min_y) / div
p = choice(allPoints)
cont = True
while cont == True:
if (p[0] > zone_x_min and p[0] < zone_x_max) and (e[1] > zone_y_min and e[1] < zone_y_max):
cont = False
else:
p = choice(allPoints)
what would be a correct, inexpensive (if possible) solution to this problem?
If it wasn't ridiculous, I think something like would work for me, in theory:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]
while p not in allPoints:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]
The question is a little ill-formed, but here's a stab.
The idea is to use a gaussian kernel density estimate, then sample from your data with weights equal to the inverse of the pdf at each point.
This is not statistically justifiable in any real sense.
import numpy as np
from scipy import stats
#random data
x = np.random.normal(size = 200)
y = np.random.normal(size = 200)
#estimate the density
kernel = stats.gaussian_kde(np.vstack([x,y]))
#calculate the inverse of pdf for each point, and normalise to sum to 1
pvector = 1/kernel.pdf(np.vstack([x,y]))/sum(1/kernel.pdf(np.vstack([x,y])))
#get a vector of indices based on your weights
np.random.choice(range(len(x)), size = 10, replace = True, p = pvector)
I believe you want to randomly select a datum point from your graph.That is, one of the little black dots.
Compute a centroid, or pick a point like (1.0, 70).
Compute the distance from each point to the centroid and let that be the probability of your choice of that point.
That is if distance(P,C) is 100 and distance(Q,C) is 1 then let P be 100x more likely to be chosen. All points are eligible to win, but the crowded ones are individually less likely (but make it up with.volume).
If I understand your initial attempt correctly, I believe there is a simple adjustment you can make to make this work.
Randomly generate an x value (0,4.5), and a y value (0,70).
Then loop through allPoints to find the closest dot.
This has the downside of large empty areas all converging to a single point. A way to help (not remove) this problem would be to make your random point have a range. If no dot exists in that range, randomly generate a new dot.
Assuming you want your selected points to be visually spread I can think of at least one "efficient/easy" method.
Choose a random point (with random.choice for example) ;
remove from your initial set any point that is "close"*;
repeat until there is no point left in your set.
*This requires that you know from the beginning how dense you want your sample to be.
Related
I have some random test data in a 2D array of shape (500,2) as such:
xy = np.random.randint(low=0.1, high=1000, size=[500, 2])
From this array, I first select 10 random samples, to select the 11th sample, I would like to pick the sample that is the furthest away from the original 10 selected samples collectively, I am using the euclidean distance to do this. I need to keep doing this until a certain amount have been picked. Here is my attempt at doing this.
# Function to get the distance between samples
def get_dist(a, b):
return np.sqrt(np.sum(np.square(a - b)))
# Set up variables and empty lists for the selected sample and starting samples
n_xy_to_select = 120
selected_xy = []
starting = []
# This selects 10 random samples and appends them to selected_xy
for i in range(10):
idx = np.random.randint(len(xy))
starting_10 = xy[idx, :]
selected_xy.append(starting_10)
starting.append(starting_10)
xy = np.delete(xy, idx, axis = 0)
starting = np.asarray(starting)
# This performs the selection based on the distances
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
dists = np.zeros(len(xy))
for selected_xy_ in selected_xy:
# Get the distance between each already selected sample, and every other unselected sample
dists_ = np.array([get_dist(selected_xy_, xy_) for xy_ in xy])
# Apply some kind of penalty function - this is the key
dists_[dists_ < 90] -= 25000
# Sum dists_ onto dists
dists += dists_
# Select the largest one
dist_max_idx = np.argmax(dists)
selected_xy.append(xy[dist_max_idx])
xy = np.delete(xy, dist_max_idx, axis = 0)
The key to this is this line - the penalty function
dists_[dists_ < 90] -= 25000
This penalty function exists to prevent the code from just picking a ring of samples at the edge of the space, by artificially shortening values that are close together.
However, this eventually breaks down, and the selection starts clustering, as shown in the image. You can clearly see that there are much better selections that the code can make before any kind of clustering is necessary. I feel that a kind of decaying exponential function would be best for this, but I do not know how to implement it.
So my question is; how would I change the current penalty function to get what I'm looking for?
From your question, I understand that what you are looking for are Periodic Boundary Conditions (PBC). Meaning that a point which at the left edge of your space is just next to the on the right end side. Thus, the maximal distance you can get along one axis is given by the half of the box (i.e. between the edge and the center).
To take into account the PBC you need to compute the distance on each axis and subtract the half of the box to that:
For example, if you have a point with x1 = 100 and a second one with x2 = 900, using the PBC they are 200 units apart : |x1 - x2| - 500. In the general case, given 2 coordinates and the half size box, you end up by having:
In your case this simplifies to:
delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500
To wrap it up, I rewrote your code using a new distance function (note that I removed some unnecessary for loops):
import numpy as np
def distance(p, arr, 500):
delta_x = np.abs(p[0] - arr[:,0])
delta_y = np.abs(p[1] - arr[:,1])
delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500
delta_y[delta_y > 500] = delta_y[delta_y > 500] - 500
return np.sqrt(delta_x**2 + delta_y**2)
xy = np.random.randint(low=0.1, high=1000, size=[500, 2])
idx = np.random.randint(500, size=10)
selected_xy = list(xy[idx])
_initial_selected = xy[idx]
xy = np.delete(xy, idx, axis = 0)
n_xy_to_select = 120
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
dists = np.zeros(len(xy))
for selected_xy_ in selected_xy:
# Compute the distance taking into account the PBC
dists_ = distance(selected_xy_, xy)
dists += dists_
# Select the largest one
dist_max_idx = np.argmax(dists)
selected_xy.append(xy[dist_max_idx])
xy = np.delete(xy, dist_max_idx, axis = 0)
And indeed it creates clusters, and this is normal as you will tend to create points clusters that are at the maximal distance from each others. More than that, due to the boundary conditions, we set that the maximal distance between 2 points along one axis is given by 500. The maximal distance between two clusters is thus also 500 ! And as you can see on the image, it is the case.
More over, picking more numbers will start to draws line to connect the different clusters, starting from the central one as you can see here :
What I was looking for is called 'Furthest Point Sampling'. I have some some more research into the solution, and the Python code used to perform this is found here: https://minibatchai.com/ai/2021/08/07/FPS.html
I am trying to pack hard-spheres in a unit cubical box, such that these spheres cannot overlap on each other. This is being done in Python.
I am given some packing fraction f, and the number of spheres in the system is N.
So, I say that the diameter of each sphere will be
d = (p*6/(math.pi*N)**)1/3).
My box has periodic boundary conditions - which means that there is a recurring image of my box in all direction. If there is a particle who is at the edge of the box and has a portion of it going beyond the wall, it will stick out at the other side.
My attempt:
Create a numpy N-by-3 array box which holds the position vector of each particle [x,y,z]
The first particle is fine as it is.
The next particle in the array is checked with all the previous particles. If the distance between them is more than d, move on to the next particle. If they overlap, randomly change the position vector of the particle in question. If the new position does not overlap with the previous atoms, accept it.
Repeat steps 2-3 for the next particle.
I am trying to populate my box with these hard spheres, in the following manner:
for i in range(1,N):
mybool=True
print("particles in box: " + str(i))
while (mybool): #the deal with this while loop is that if we place a bad particle, we need to change its position, and restart the process of checking
for j in range(0,i):
displacement=box[j,:]-box[i,:]
for k in range(3):
if abs(displacement[k])>L/2:
displacement[k] -= L*np.sign(displacement[k])
distance = np.linalg.norm(displacement,2) #check distance between ith particle and the trailing j particles
if distance<diameter:
box[i,:] = np.random.uniform(0,1,(1,3)) #change the position of the ith particle randomly, restart the process
break
if j==i-1 and distance>diameter:
mybool = False
break
The problem with this code is that if p=0.45, it is taking a really, really long time to converge. Is there a better method to solve this problem, more efficiently?
I think what you are looking for is either the hexagonal closed-packed (HCP or sometime called face-centered cubic, FCC) lattice or the cubic closed-packed one (CCP). See e.g. Wikipedia on Close-packing of equal spheres.
Since your space has periodic conditions, I believe it doesn't matter which one you chose (hcp or ccp), and they both achieve the same density of ~74.04%, which was proved by Gauss to be the highest density by lattice packing.
Update:
For the follow-up question on how to generate efficiently one such lattice, let's take as an example the HCP lattice. First, let's create a bunch of (i, j, k) indices [(0,0,0), (1,0,0), (2,0,0), ..., (0,1,0), ...]. Then, get xyz coordinates from those indices and return a DataFrame with them:
def hcp(n):
dim = 3
k, j, i = [v.flatten()
for v in np.meshgrid(*([range(n)] * dim), indexing='ij')]
df = pd.DataFrame({
'x': 2 * i + (j + k) % 2,
'y': np.sqrt(3) * (j + 1/3 * (k % 2)),
'z': 2 * np.sqrt(6) / 3 * k,
})
return df
We can plot the result as scatter3d using plotly for interactive exploration:
import plotly.graph_objects as go
df = hcp(12)
fig = go.Figure(data=go.Scatter3d(
x=df.x, y=df.y, z=df.z, mode='markers',
marker=dict(size=df.x*0 + 30, symbol="circle", color=-df.z, opacity=1),
))
fig.show()
Note: plotly's scatter3d is not a very good rendering of spheres: the marker sizes are constant (so when you zoom in and out, the "spheres" will appear to change relative size), and there is no shading, limited z-ordering faithfulness, etc., but it's convenient to interact with the plot.
Resize and clip to the unit box:
Here, a strict clipping (each sphere needs to be completely inside the unit box). Your "periodic boundary condition" is something you will need to address separately (see further below for ideas).
def hcp_unitbox(r):
n = int(np.ceil(1 / (np.sqrt(3) * r)))
df = hcp(n) * r
df += r
df = df[(df <= 1 - r).all(axis=1)]
return df
With this, you find that a radius of 0.06 gives you 608 fully enclosed spheres:
hcp_unitbox(.06).shape # (608, 3)
Where you would go next:
You may dig deeper into the effect of your so-called "periodic boundary conditions", and perhaps play with some rotations (and small translations).
To do so, you may try to generate an HCP-lattice that is large enough that any rotation will still fully enclose your unit cube. For example:
r = 0.2 # example
n = int(np.ceil(2 / r))
df = hcp(n) * r - 1
Then rotate it (by any amount) and translate it (by up to 1 radius in any direction) as you wish for your research, and clip. The "periodic boundary conditions", as you call them, present a bit of extra challenge, as the clipping becomes trickier. First, clip any sphere whose center is outside your box. Then select spheres close enough to the boundaries, or even partition the regions of interest into overlapping regions along the walls of your cube, then check for collisions among the spheres (as per your periodic boundary conditions) that fall in each such region.
I have added a link to the data set here.The first script produces a line graph using signal output data. My next step was to identify the peaks present on the line graph. The second script has an algorithm to identify all the peaks present on the line graph. However it is too sensitive. It classifies even the slightest bumps on the graph as a peak. I do not want this. I only wish to identify the conspicous(large) bumps as peaks. How do I modify the second script to do this?[Line Graph][2]
import matplotlib.pyplot as plt
import numpy as np
X, Y = [], []
X = np.zeros((10, 4096))
Y = np.zeros((10, 4096))
n=0
m=0
for line in open('data_set2.txt', 'r'):
values = [float(s) for s in line.split()]
X[n,0] = values[0]-1566518691968
for m in range(4096):
Y[n,m]=values[m+1]
n=n+1
plt.plot(Y[1,0:4095])
plt.show()
b = (X[1:]-X[:-1])[:-1]
c = (X[:-1]-X[1:])[1:]
minima = np.where(np.bitwise_and(b<0, c<0))[0]+1
maxima = np.where(np.bitwise_and(b>0, c>0))[0]+1
all_peaks = np.where((b*c)>0)[0]+1
del b,c
print(minima)
print(maxima)
print(all_peaks)
I wish you had attached the data set so I can try my solution before I post it here. What I think you're doing is looking for all the points that are higher than the point before and point after them, which is why you're ending up with too many points. What your code should look for tho is a number of the highest peaks. It does not matter if the point is a peak, and the peak's height itself does not matter either, what matters is the "uniqueness" of the peak, or how much it is higher than the average point. Imagine removing the highest three peaks in your example and zooming in, you will find a new number of peaks that look much higher than the rest; and so on.
The tricky thing for you is to find the number of those peaks which depends on how sensitive you want your code to be.
There are some packages for identifying peaks. scypi provides scipy.signal.find_peaksfunction (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html). There are also peak identification for matlab, octave and others.
All of them require some quantitative criteria for peak identification. None will work with just "conspicous(large) bumps".
UPD. So, if you want to write the code yourself, you have choose some filtering function. There may be some pretty obvious:
Peaks that have value greater than x;
Peaks that are greater than x% of maximum peak's value;
x largest peaks (where x < total numbers of peaks.
Varying parameter x you can come to solution that satisfies your criteria of "conspicous(large) bumps".
P. S. There might be other filtering functions, but it looks that in your case width and form of peak do not matter.
I am unaware of the algorithms suggested by someone here; if I were you I'd check them out because their way should be more efficient. However, if you want to be lazy, here is a solution:
First I gather all the peaks (the same peaks you get):
x_range = range(4096)
peaks = []
for line in open('data_set2.txt', 'r'):
values = [float(s) for s in line.split()]
X[n, 0] = values[0] - 1566518691968
for m in range(4096):
Y[n, m] = values[m + 1]
# only work with the plotted row starting from the third value
if n != 1 or m in (0, 1):
continue
# if a point is higher than the point before and point after
if Y[n, m - 2] < Y[n, m - 1] > Y[n, m]:
peaks.append((x_range[m-1], Y[n, m - 1]))
n = n + 1
plt.plot(Y[1, 0:4095])
plt.show()
Then I loop through each 100 points (assuming no two peaks can occur in 100 point-range, otherwise one of them will be discarded) and find the maximum. If the segment maximum is some percentage of the absolute maximum, it is included.
max_ = max(peaks, key=lambda x: x[1])[1]
# If the graph does not have peaks
if max_ < np.mean(Y) * 10: # Tweak point 1
print("No peaks")
sys.exit()
highest = []
sensitivity = 0.2 # Tweak point 2
for i in range(0, len(peaks), 100):
try:
segment = peaks[i: i + 100]
except IndexError:
segment = peaks[i:]
finally:
segment_max = max(segment, key=lambda x: x[1])[1]
if segment_max >= max_ * sensitivity:
highest.append(segment_max)
print(highest)
I have a large quantity of pixel colors (96 thousands different colors):
And I want to get some kind of a mathematically-defined probability region like in this question:
The main obstacle I see right now – all methods on Google are mainly about visualisations and about two-dimensional spaces, yet there is no algorithm for finding coefficients of an equation like:
a1x2 + b1y2 + c1y2 + a2xy + b2xz + c2yz + a3x + b3y + c3z = 0
And this paper is too difficult for me to implement it in python. :(
Anyway, what I just want is to determine if some pixel is more-or-less lies within the diapason I have.
I tried making it using scikit clustering, but I failed due to having only one
set of data, probably. And creating an array 2563 elements
representing each pixel color seems a wrong way.
I wonder if there is an easy way to determine boundaries of this point cluster?
Or, maybe I'm just overthinking it and there is something like OpenCV
cv2.inRange() function?
this can be solved by optimization and fitting of the ellipsoid polynomial. However I would start with geometrical approach which is much faster:
find avg point position
that will be the center of your ellipsoid
p0 = sum (p[i]) / n // average
i = { 0,1,2,3,...,n-1 } // of all points
If your point density is not homogenuous then it is safer to use bounding box center instead. So find xmin,ymin,zmin,xmax,ymax,zmax and the middle between them is your center.
find most distant point to center
that will give you main semi axis
pa = p[j];
|p[j]-p0| >= |p[i]-p0| // max
i = { 0,1,2,3,...,n-1 } // of all points
find second semi-axises
so vector pa-p0 is normal to plane in which the other semi-axises should be. So find most distant point to p0 from that plane:
pb = p[j];
|p[j]-p0| >= |p[i]-p0| // max
dot(pa-p0,p[j]-p0) == 0 // but inly if inside plane
i = { 0,1,2,3,...,n-1 } // from all points
beware that the result of dot product may not be precisely zero so it is better to test against something like this:
|dot(pa-p0,p[j]-p0)| <= 1e-3
You can use any threshold you want (should be based on the ellipsoid size).
find last semi-axis
So we know that last semi-axis should be perpendicular to both
(pa-p0) AND (pb-p0)
So find point such that:
pc = p[j];
|p[j]-p0| >= |p[i]-p0| // max
dot(pa-p0,p[j]-p0) == 0 // but inly if inside plane
dot(pb-p0,p[j]-p0) == 0 // and perpendicular also to b semi-axis
i = { 0,1,2,3,...,n-1 } // from all points
Ellipsoid
Now you have all the parameters you need to form your ellipsoid. vectors
(pa-p0),(pb-p0),(pc-p0)
are the basis vectors of your ellipsoid (you can make them perpendicular by using cross product). Their size gives you the radiuses. And p0 is the center. You can also use this parametric equation:
a=pa-p0;
b=pb-p0;
c=pc-p0;
p(u,v) = p0 + a*cos(u)*cos(v)
+ b*cos(u)*sin(v)
+ c*sin(u);
u = < -0.5*PI , +0.5*PI >
v = < 0.0 , 2.0*PI >
This whole process is just O(n) and the results can be used as start point for both optimization and fitting to speed them up without the loss of accuracy. If you want to further improve accuracy See:
How approximation search works
The sub links shows you examples of fitting ...
You can also take a look at this:
Algorithms: Ellipse matching
which is basically similar to your task but only in 2D still may bring you some ideas.
Here is unstrict solution with fast and simple random search approach*. Best side - no heavy linear algebra library required**. Seems it worked fine for mesh collision detection.
Is assumes that ellipsoid center matches cloud center and then uses some sort of mirrored average to search for main axis.
Full working code is slightly bigger and placed on git, idea of main axis search is here:
np.random.shuffle(pts)
pts_len = len(pts)
pt_average = np.sum(pts, axis = 0) / pts_len
vec_major = pt_average * 0
minor_max, major_max = 0, 0
# may be improved with overlapped pass,
for pt_cur in pts:
vec_cur = pt_cur - pt_average
proj_len, rej_len = proj_length(vec_cur, vec_major)
if proj_len < 0:
vec_cur = -vec_cur
vec_major += (vec_cur - vec_major) / pts_len
major_max = max(major_max, abs(proj_len))
minor_max = max(minor_max, rej_len)
It can be improved/optimized even more at some points. Examples what it will produce:
And full experiment code with plots
*i.e. adjusting code lines randomly until they work
**was actually reason to figure out this solution
Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.