I am trying to do a map in 2D coordinates with color defined by a third variable. I already defined the grids by the following command:
b_step = np.linspace(-75,90,12)
l_step = np.linspace(0,360,25)
grid = [(x,y) for x in b_step for y in l_step]
There are three variables in my data set, one is b, l, which are the coordinates, the real data is called s. There are about 7 million datasets. I first want to distribute the data in those grid points, then take average within each grid. Then finally I will use the average s to do map. Anyone has any ideas how to distribute the data in the grid points efficiently and take average?
I know ROOT TH2F (which is a powerful software for High Energy community) can handle it, but I want to write it more pythonic. Thanks.
ROOT TH2F is the best way to handle it efficiently. If you create two TH2F histogram, one tracks the data, the other one tracks total number contributed, then you can calculate the mean value in each grid point. The python code for this is below:
from ROOT import TH2F, gStyle, TCanvas
##### if you want equally distributed grid points.
#h1 = TH2F('h1','h1',l_num,0.0,360.0,b_num,-90.0,90.0)
#h2 = TH2F('h2','h2',l_num,0.0,360.0,b_num,-90.0,90.0)
##### if you want non-equally distributed grid points.
xBins = 37
yBins = 17
xEdges = np.linspace(-185,185,38)
yEdges = np.array([-105.0,-75.0,-60.0,-45.0,-30.0,-15.0,15.0,35.0,40.0,45.0,50.0,55.0,60.0,65.0,70.0,75.0,80.0,100.0])
h1 = TH2F('h1','h1',xBins,xEdges,yBins,yEdges)
h2 = TH2F('h2','h2',xBins,xEdges,yBins,yEdges)
for i in range(data_size):
h1.Fill(x[i],y[i],signal)
h2.Fill(x[i],y[i],1)
for ii in range(1,h1.GetNbinsX()+1):
for jj in range(1,h1.GetNbinsY()+1):
ss = h1.GetBinContent(ii,jj)
nn = h2.GetBinContent(ii,jj)
xx = h1.GetXaxis().GetBinCenter(ii)
yy = h1.GetYaxis().GetBinCenter(jj)
mean = ss/nn
Now you already have the grip coordinates xx and yy, and the data points within it ss, then you can make color plots.
Related
I'm trying to see if I can work with GIS raster arithmetic in python using numpy ndarrays. I am calculating probability values (z) for multiple defined areas (x, y), and want to subsequently add up the probabilities. The grids overlap but don't have the same dimensions in x, y. The result should show the probabilities added where the grids overlap and keep the probability values of the respective grids where they don't.
I have worked out the grids, but cannot add them together. Can this be done in numpy or do I need to use rasterio/GDAL tools? Are mgrids the best approach?
I've created some simple mgrids to illustrate the issue.
ys, xs = np.mgrid[5:15:5j, 0:5:5j]
f = lambda x, y: x * y + 1
vf = np.vectorize(f)
r = vf(xs, ys)
c1 = np.array([xs, ys, r])
ys2, xs2 = np.mgrid[4:9:6j, 1:6:6j]
f2 = lambda x, y: x + y * 2
vf2 = np.vectorize(f2)
r2 = vf2(xs2, ys2)
c2 = np.array([xs2, ys2, r2])
To plot them:
plt.contourf(c1[0], c1[1], c1[2], levels = 100)
plt.colorbar()
plt.show()
c1 plot
plt.contourf(c2[0], c2[1], c2[2], levels = 100)
plt.colorbar()
plt.show()
c2 plot
Your two grids aren't in a strict superset-subset relationship. Here's how they look like:
As you can see, while the domains of the two grids overlap, the actual grid points are disjoint save a single grid point.
How would you accumulate these data points? The arrays you have store a probability z_i = z(x_i, y_i) for each grid point. The function is discrete and only defined on a grid. You cannot add two datasets unless their grid points correspond to one another. It doesn't make any sense to do so.
What you could do is interpolate the data from either of the grids and add that to the other grid (or a third common grid). This will not be exact, and can only work if your data is smooth enough. Furthermore, your grids only have partial overlap, so you will have to figure out how to handle regions where you have multiple points versus those where you don't. contourf will also accept data that is 2d plaid (i.e. as if generated from mgrid). So the answer your original question is "you can't straightforwardly do that".
I have a square 2D array data that I would like to add to a larger 2D array frame at some given set of non-integer coordinates coords. The idea is that data will be interpolated onto frame with it's center at the new coordinates.
Some toy data:
# A gaussian to add to the frame
x, y = np.meshgrid(np.linspace(-1,1,10), np.linspace(-1,1,10))
data = 50*np.exp(-np.sqrt(x**2+y**2)**2)
# The frame to add the gaussian to
frame = np.random.normal(size=(100,50))
# The desired (x,y) location of the gaussian center on the new frame
coords = 23.4, 22.6
Here's the idea. I want to add this:
to this:
to get this:
If the coordinates were integers (indexes), of course I could simply add them like this:
frame[23:33,22:32] += data
But I want to be able to specify non-integer coordinates so that data is regridded and added to frame.
I've looked into PIL.Image methods but my use case is just for 2D data, not images. Is there a way to do this with just scipy? Can this be done with interp2d or a similar function? Any guidance would be greatly appreciated!
Scipy's shift function from scipy.ndimage.interpolation is what you are looking for, as long as the grid spacings between data and frame overlap. If not, look to the other answer. The shift function can take floating point numbers as input and will do a spline interpolation. First, I put the data into an array as large as frame, then shift it, and then add it. Make sure to reverse the coordinate list, as x is the rightmost dimension in numpy arrays. One of the nice features of shift is that it sets to zero those values that go out of bounds.
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage.interpolation import shift
# A gaussian to add to the frame.
x, y = np.meshgrid(np.linspace(-1,1,10), np.linspace(-1,1,10))
data = 50*np.exp(-np.sqrt(x**2+y**2)**2)
# The frame to add the gaussian to
frame = np.random.normal(size=(100,50))
x_frame = np.arange(50)
y_frame = np.arange(100)
# The desired (x,y) location of the gaussian center on the new frame.
coords = np.array([23.4, 22.6])
# First, create a frame as large as the frame.
data_large = np.zeros(frame.shape)
data_large[:data.shape[0], :data.shape[1]] = data[:,:]
# Subtract half the distance as the bottom left is at 0,0 instead of the center.
# The shift of 4.5 is because data is 10 points wide.
# Reverse the coords array as x is the last coordinate.
coords_shift = -4.5
data_large = shift(data_large, coords[::-1] + coords_shift)
frame += data_large
# Plot the result and add lines to indicate to coordinates
plt.figure()
plt.pcolormesh(x_frame, y_frame, frame, cmap=plt.cm.jet)
plt.axhline(coords[1], color='w')
plt.axvline(coords[0], color='w')
plt.colorbar()
plt.gca().invert_yaxis()
plt.show()
The script gives you the following figure, which has the desired coordinates indicated with white dotted lines.
One possible solution is to use scipy.interpolate.RectBivariateSpline. In the code below, x_0 and y_0 are the coordinates of a feature from data (i.e., the position of the center of the Gaussian in your example) that need to be mapped to the coordinates given by coords. There are a couple of advantages to this approach:
If you need to "place" the same object into multiple locations in the output frame, the spline needs to be computed only once (but evaluated multiple times).
In case you actually need to compute integrated flux of the model over a pixel, you can use the integral method of scipy.interpolate.RectBivariateSpline.
Resample using spline interpolation:
from scipy.interpolate import RectBivariateSpline
x = np.arange(data.shape[1], dtype=np.float)
y = np.arange(data.shape[0], dtype=np.float)
kx = 3; ky = 3; # spline degree
spline = RectBivariateSpline(
x, y, data.T, kx=kx, ky=ky, s=0
)
# Define coordinates of a feature in the data array.
# This can be the center of the Gaussian:
x_0 = (data.shape[1] - 1.0) / 2.0
y_0 = (data.shape[0] - 1.0) / 2.0
# create output grid, shifted as necessary:
yg, xg = np.indices(frame.shape, dtype=np.float64)
xg += x_0 - coords[0] # see below how to account for pixel scale change
yg += y_0 - coords[1] # see below how to account for pixel scale change
# resample and fill extrapolated points with 0:
resampled_data = spline.ev(xg, yg)
extrapol = (((xg < -0.5) | (xg >= data.shape[1] - 0.5)) |
((yg < -0.5) | (yg >= data.shape[0] - 0.5)))
resampled_data[extrapol] = 0
Now plot the frame and resampled data:
plt.figure(figsize=(14, 14));
plt.imshow(frame+resampled_data, cmap=plt.cm.jet,
origin='upper', interpolation='none', aspect='equal')
plt.show()
If you also want to allow for scale changes, then replace code for computing xg and yg above with:
coords = 20, 80 # change coords to easily identifiable (in plot) values
zoom_x = 2 # example scale change along X axis
zoom_y = 3 # example scale change along Y axis
yg, xg = np.indices(frame.shape, dtype=np.float64)
xg = (xg - coords[0]) / zoom_x + x_0
yg = (yg - coords[1]) / zoom_y + y_0
Most likely this is what you actually want based on your example. Specifically, the coordinates of pixels in data are "spaced" by 0.222(2) distance units. Therefore it actually seems that for your particular example (whether accidental or intentional), you have a zoom factor of 0.222(2). In that case your data image would shrink to almost 2 pixels in the output frame.
Comparison to #Chiel answer
In the image below, I compare the results from my method (left), #Chiel's method (center) and difference (right panel):
Fundamentally, the two methods are quite similar and possibly even use the same algorithm (I did not look at the code for shift but based on the description - it also uses splines). From comparison image it is visible that the biggest differences are at the edges and, for unknown to me reasons, shift seems to truncate the shifted image slightly too soon.
I think the biggest difference is that my method allows for pixel scale changes and it also allows re-use of the same interpolator to place the original image at different locations in the output frame. #Chiel's method is somewhat simpler but (what I did not like about it is that) it requires creation of a larger array (data_large) into which the original image is placed in the corner.
While the other answers have gone into detail, but here's my lazy solution:
xc,yc = 23.4, 22.6
x, y = np.meshgrid(np.linspace(-1,1,10)-xc%1, np.linspace(-1,1,10)-yc%1)
data = 50*np.exp(-np.sqrt(x**2+y**2)**2)
frame = np.random.normal(size=(100,50))
frame[23:33,22:32] += data
And it's the way you liked it. As you mentioned, the coordinates of both are the same, so the origin of data is somewhere between the indices. Now just simply shift it by the amount you want it to be off a grid point (remainder to one) in the second line and you're good to go (you might need to flip the sign, but I think this is correct).
Suppose I have a matrix of the form where first column is all x points, the second column is all y points, and then the third and fourth are indicator variables telling whether the point belongs to a particular 'cluster' (can be either 1 or 0; so if in column 3 I have 1 for a third row, it means that the point of the third row, belongs to say cluster 1, which is represented by column 3).
My question is, how do I create a figure, scatter plot all the points belonging to cluster 1 and then on the same plot have scatter of the remaining points in another color. In Matlab, I would just say figure, then hold on and write out my commands. I am new to plotting in Python and not sure how this would be performed.
EDIT:
I think I made it work. How would I however, change marker size, depending on which cluster the point belongs to
Let's start with how we'd do this in MATLAB.
Supposing you have N unique clusters, you can simply loop through as many clusters as you have and plot the points in a different colour. Also, we can change the marker size at each iteration. You'll need to use logical indexing to extract out the points that belong to each cluster. Given that your matrix is stored in M, something like this comes to mind:
rng(123); %// Set random seeds
%// Total number of clusters
N = max(M(:,3));
%// Create a colour map
cmap = rand(N,3);
%// Store point sizes per cluster
sizes = [10 14 18];
figure; hold on; %// Create a blank figure and hold for changes
for ii = 1 : N
%// Determine those points belonging to the ith cluster
ind = M(:,3) == ii;
%// Get the x and y coordinates
x = M(ind,1);
y = M(ind,2);
%// Plot the points in a different colour
plot(x,y,'.','Color', cmap(ii,:), 'MarkerSize', sizes(ii));
end
%// Create labels
labels = sprintfc('Label %d', 1:N);
%// Make our legend
legend(labels{:});
The code is pretty self explanatory, you need to define your matrix M and we determine the total number of clusters by taking the max of the third column. Next we create a random colour map which has as many rows as there are clusters and there are three columns corresponding to a unique RGB colour per cluster. Each row defines a colour for each cluster which we'll use when plotting.
Next we create an array of sizes where we store the radius of each point stored in an array per cluster. We create a blank figure, hold it for changes we make to the plot then we iterate over each cluster of points. For each cluster of points, figure out the right points in M to extract out through logical indexing, extract out the x and y coordinates for those points then plot these points on your figure in a scatter formation where we manually specify the colour as a RGB tuple as well as the desired marker size.
We then create a cell array of labels that denote which set of points each cluster belongs to, then show a legend illustrating which points belong to which clusters given this array of labels.
Generating random data with random labels, where we have 20 points uniformly distributed between [0,1] for both x and y and generating a random set of up to three labels:
rng(123);
M = [rand(20,2) randi(3,20,1)];
I get this plot when I run the above code:
To get the equivalent in Python, well that's pretty easy. It's just a transcription from MATLAB to Python and the plotting mechanisms are exactly the same. You're using matplotlib and so I'm assuming numpy can be used as it's a dependency.
As such, the equivalent code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
# Total number of clusters
N = int(np.max(M[:,2]))
# Create a colour map
cmap = np.random.rand(N, 3)
# Store point sizes per cluster
sizes = np.array([10, 14, 18]);
plt.figure(); # Create blank figure. No need to hold on
for ii in range(N):
# Determine those points belonging to the ith cluster
ind = M[:,2] == (ii+1)
# Get the x and y coordinates
x = M[ind,0];
y = M[ind,1];
# Plot the points in a different colour
# Also add in labels for legend
plt.plot(x,y,'.',color=tuple(cmap[ii]), markersize=sizes[ii], label='Cluster #' + str(ii+1))
# Make our legend
plt.legend()
# Show the image
plt.show()
I won't bother explaining this one because it's pretty much the same as what you see in the MATLAB code. There are some nuances, such as the way hold on works in matplotlib. You don't need to use hold on because any changes you make the figure will be remembered until you decide to show the figure. You also have the nuances where numpy and Python start indexing at 0 instead of 1.
Using the same generation data code like in MATLAB:
M = np.column_stack([np.random.rand(20,2), np.random.randint(1,4,size=(20,1))])
I get this figure:
I have a problem where I use a computer program called MCNP to calculate the energy deposition in a square geometry from a particle flux. The square geometry is broken down into a mesh grid with 50 cubic meshes in length, width and height. The data is placed into a text file displaying the centroid position of each mesh in cartesian coordinates (x,y and z position) and the energy deposition at that x,y,z coordinate. The data is then extracted with a Python script. I have a script that allows me to take a slice in the z plane and plot a heat map of energy deposition on that plane and the script works, but I dont think it is very efficient and I am looking for solutions to vectorize the process.
The code reads in the X, Y and Z coordinates as three separate 1-D numpy arrays and also reads in the energy deposition at that coordinate as a 1-D numpy array. For the sake of this description, lets assume I want to take a slice at the Z coordinate of zero, but none of the mesh centroids are at the z-coordinate of 0, then I have to (and do) cycle through all of the data points in the Z-coordinate array until it finds one that is greater than zero (array index i) with a proceeding array index (i-1) that is less than zero. It then needs to use those array points in Z-space along with the slice location (in this case 0) and the energy deposition at those array indices and interpolate to find the correct energy deposition at that z-location of the slice. Since the X and Y arrays are unaffected, now I have the coordinate of X, Y and can plot a heat map of that specific X,Y location and the Energy deposition at the slice location. The code also needs to determine if the slice location is already in the data set, in which case no interpolation is needed. The code I have works, but I could not see how to use built in scipy interpolation schemes and instead wrote a function to do the interpolation. In this scenario and had to use a for loop to iterate until I found the position where the z-position was above and below the slice location (z=0 in this instance). I am attaching my example code in this post and am asking for help to better vectorize this code snippet (if it can be better vectorized) and hopefully learn something in the process.
# - This transforms the read in data from a list to a numpy array
# where Magnitude represents the energy deposition
XArray = np.array(XArray); YArray = np.array(YArray)
ZArray = np.array(ZArray); Magnitude = np.array(Magnitude)
#==============================================================
# - This section creates planar data for a 2-D plot
# Interpolation function for determining 2-D slice of 3-D data
def Interpolate(X1,X2,Y1,Y2,X3):
Slope = (Y2-Y1)/(X2-X1)
Y3 = (X3-X1)*Slope
Y3 = Y3 + Y1
return Y3
# This represents the location on the Z-axis where a slice is taken
Slice_Location = 0.0
XVal = []; YVal = []; ZVal = []
Tally = []; Error = []
counter = 1.0
length = len(XArray)-1
for numbers in range(length):
# - If data falls on the selected plane location then use existing data
if ZArray[counter] == Slice_Location:
XVal.append(XArray[counter])
YVal.append(YArray[counter])
ZVal.append(ZArray[counter])
Tally.append(float(Magnitude[counter]))
# - If existing data does not exist on selected plane then interpolate
if ZArray[counter-1] < Slice_Location and ZArray[counter] > Slice_Location:
XVal.append(XArray[counter])
YVal.append(YArray[counter])
ZVal.append(Slice_Location)
Value = Interpolate(ZArray[counter-1],ZArray[counter],Magnitude[counter-1], \
Magnitude[counter],Slice_Location)
Tally.append(float(Value))
counter = counter + 1
XVal = np.array(XVal); YVal = np.array(YVal); ZVal = np.array(ZVal)
Tally = np.array(Tally);
Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.