There's some context to this, so bear with me please.
I have a list of lists, call it nested_lists, where each list is of the form [[1,2,3,...], [4,3,1,...]] (i.e. each list contains two lists of integers). Now, in each of these lists, the two lists of integers have the same length and two integers corresponding to the same index represent a coordinate in R^2.
So for example, (1,4) would be one coordinate from the above example.
Now, my task is to draw 5 unique coordinates from nested_lists uniformly (i.e. each coordinate has the same probability of being chosen), without replacement. That is, from all of the coordinates from the lists in nested_lists, I am trying to draw 5 unique coordinates uniformly without replacement.
One very straightforward way to do this would be to : 1. Create a list of ALL the unique coordinates in nested_lists. 2. Use numpy.random.choice to sample 5 elements uniformly without replacement.
The code would be something like this:
import numpy as np
coordinates = []
#Get list of all unique coordinates
for list in nested_lists:
l = len(list[0])
for i in range(0, l):
coordinate = (list[0][i], list[1][i])
if coordinate not coordinates:
coordinates += [coordinate]
draws = np.random.choice(coordinates, 5, replace=False, p= [1/len(coordinates)]*len(coordinates))
But getting a set of all the unique coordinates can be very computationally expensive, especially if nested_lists contains millions of lists, each with thousands of coordinates in them. So I'm looking for methods to perform the same draws without having to get a list of all the coordinates first.
One method I thought of would be to sample with weighted probabilities from each list in nested_lists.
So get a list of the sizes (number of coordinates) of each list, and then go through each list and draw a coordinate with probability (size/sum(size))*(1/sum(sizes)). Repeating the process until 5 unique coordinates are drawn should then correspond to what we wanted to draw. The code would be something like this:
no_coordinates = lambda x: len(x[0])
sizes = list(map(no_coordinates, nested_lists))
i = 0
sum_sizes = sum(sizes)
draws = []
while i != 5: #to make sure we get 5 draws
for list in nested_lists:
size = len(list[0])
p = size/(sum_sizes**2)
for j in range(0, size):
if i >= 5: exit for loop when we reach 5 draws
break
if np.random.random() < p and (list[0][j], list[1][j]) not in draws:
draws += (list[0][j], list[1][j])
i += 1
The code above seems to be more computationally efficient, but I am not sure if it actually draws with the same probability that would be required overall. From my calculation, the overall probability would sum(size)/sum_sizes**2 which is the same as 1/sum_sizes (our required probability), but again, I'm not sure if this is correct.
So I was wondering if there are more efficient approaches to drawing like I want, and if my approach is actually correct or not.
You can use bootstrapping. Basically, the idea is to draw some large (but fixed) amount of coordinates with replacement to estimate probability of each coordinate. Then, you can subsample from this list using transformed densities.
from collections import Counter
bootstrap_sample_size = 1000
total_lists = len(nested_lists)
list_len = len(nested_lists[0])
# set will make more sense in this example
# I used counter to allow for future statistical manipulations
c = Counter()
for _ in range(bootstrap_sample_size):
x, y = random.randrange(total_lists), random.randrange(list_len)
random_point = nested_lists[x][0][y], nested_lists[x][1][y]
c.update((random_point,))
# now c contains counts for 1000 points with replacements
# let's just ignore these probabilities to get uniform sample
result = random.sample(c.keys(), 5)
This will not be exactly uniform, but bootstrap provides statistical guarantees that it will be arbitrary close to uniform distribution as the bootstrap_sample_size is increased. 1000 samples is usually enough for most real-life applications.
Related
I have a set of sphere coordinates in 3D that evolves.
They represent a stack of spheres which are continuously removed from a box from the bottom of the geometry, and reinserted at the top at a random location. Since this kind of simulation is really periodic, I would like to simulate the drainage of the box a few times (say, 5 times, so t=1 takes positions 1 -> t=5 takes positions 5), and then come back to the first state to simulate the next steps (t=6 takes position 1, t=10 takes positions 5, same for t=11->15, etc.)
The problem is that at the coordinates of a given sphere (say, sphere 1) can be very different from the first state to the last simulated one. However, it is very important, for the sake of the simulation, to have a simulation as smooth as possible. If I had to quantify it, I would say that I need the distance between state 5 and state 6 for each pebble to be as low as possible.
It seems to me like an assignment problem. Is there any known solution and method for this kind of problems?
Here is an example of what I would like to have (I mostly use Python):
import numpy as np
# Mockup of the simulation positions
Nspheres = 100
Nsteps = 5 # number of simulated steps
coordinates = np.random.uniform(0,100, (Nsteps, Nspheres, 3)) # mockup x,y,z for each step
initial_positions = coordinates[0]
final_positions = coordinates[Nsteps-1]
**indices_adjust_initial_positions = adjust_initial_positions(initial_positions, final_positions) # to do**
adjusted_initial_positions = initial_positions[indices_adjust_initial_positions]
# Quantification of error made
mean_error = np.mean(np.abs(final_positions-adjusted_initial_positions))
max_error = np.max(np.abs(final_positions-adjusted_initial_positions))
print(mean_error, max_error)
# Assign it for each "cycle"
Ncycles = 5 # Number of times the simulation is repeated
simulation_coordinates = np.empty((Nsteps*Ncycles, Nspheres, 3))
simulation_coordinates[:Nsteps] = np.array(coordinates)
for n in range(1, Ncycles):
new_cycle_coordinates = simulation_coordinates[Nsteps*(n-1):Nsteps*(n):, indices_adjust_initial_positions, :]
simulation_coordinates[Nsteps*n:Nsteps*(n+1)] = new_cycle_coordinates
# Print result
print(simulation_coordinates)
The adjust_initial_positions would therefore take the initial and final states, and determine what would be the ideal set of indices to apply to the initial state to look the most like the final state. Please note that if that makes the problem any simpler, I do not really care if the very top spheres are not really matching between the two states, however it is important to be as close as possible at more towards the bottom.
Would you have any suggestion?
After some research, it seems that scipy.optimize has some nice features able to do something like it. If list1 is my first step, list2 is my last simulated step, we can do something like:
cost = np.linalg.norm(list2[:, np.newaxis, :] - list1, axis=2)
_, indexes = scipy.optimize.linear_sum_assignment(cost)
list3 = list1[indexes]
Therefore, list3 will be as close as list2 as possible thanks to the index sorting, while taking the positions of list1.
A square box of size 10,000*10,000 has 10,00,000 particles distributed uniformly. The box is divided into grids, each of size 100*100. There are 10,000 grids in total. At every time-step (for a total of 2016 steps), I would like to identify the grid to which a particle belongs. Is there an efficient way to implement this in python? My implementation is as below and currently takes approximately 83s for one run.
import numpy as np
import time
start=time.time()
# Size of the layout
Layout = np.array([0,10000])
# Total Number of particles
Population = 1000000
# Array to hold the cell number
cell_number = np.zeros((Population),dtype=np.int32)
# Limits of each cell
boundaries = np.arange(0,10100,step=100)
cell_boundaries = np.dstack((boundaries[0:100],boundaries[1:101]))
# Position of Particles
points = np.random.uniform(0,Layout[1],size = (Population,2))
# Generating a list with the x,y boundaries of each cell in the grid
x = []
limit_list = cell_boundaries
for i in range(0,Layout[1]//100):
for j in range(0,Layout[1]//100):
x.append([limit_list[0][i,0],limit_list[0][i,1],limit_list[0][j,0],limit_list[0][j,1]])
# Identifying the cell to which the particles belong
i=0
for y in (x):
cell_number[(points[:,1]>y[0])&(points[:,1]<y[1])&(points[:,0]>y[2])&(points[:,0]<y[3])]=i
i+=1
print(time.time()-start)
I am not sure about your code. You seem to be accumulating the i variable globally. While it should be accumulated on a per cell basis, correct? Something like cell_number[???] += 1, maybe?
Anyhow, the way I see is from a different perspective. You could start by assigning each point a cell id. Then inverse the resulting array with a kind of counter function. I have implemented the following in PyTorch, you will most likely find equivalent utilities in Numpy.
The conversion from 2-point coordinates to cell ids corresponds to applying floor on the coordinates then unfolding them according to your grid's width.
>>> p = torch.from_numpy(points).floor()
>>> p_unfold = p[:, 0]*10000 + p[:, 1]
Then you can "inverse" the statistics, i.e. find out how many particles there are in each respective cell based on the cell ids. This can be done using PyTorch histogram's counter torch.histc:
>>> torch.histc(p_unfold, bins=Population)
Consider a 3D numpy array D of dimension, say, (30 x 40 x 50). For each voxel D[x,y,z] I want to store a vector that contains neighboring voxels within a certain radius (including the D[x,y,z] itself).
(As an example here is a picture of such a sphere of radius 2: https://puu.sh/wwIYW/e3bd63ceae.png)
Is there a simple and fast way to code this?
I have written a function for it, but it is painfully slow and IDLE eventually crashes because the data structure I store the vectors in becomes too large.
Current code:
def searchlight(M_in):
radius = 4
[m,n,k] = M_in.shape
M_out = np.zeros([m,n,k],dtype=object)
count = 0
for i in range(m):
for j in range(n):
for z in range(k):
i_interval = list(range((i-4),(i+5)))
j_interval = list(range((j-4),(j+5)))
z_interval = list(range((z-4),(z+5)))
coordinates = list(itertools.product(i_interval,j_interval,z_interval))
coordinates = [pair for pair in coordinates if ((abs(pair[0]-i)+abs(pair[1]-j)+abs(pair[2]-z))<=radius)]
coordinates = [pair for pair in coordinates if ((pair[0]>=0) and (pair[1]>=0) and pair[2]>=0) and (pair[0]<m) and (pair[1]<n) and (pair[2]<k)]
out = []
for pair in coordinates:
out.append(M_in[pair[0],pair[1],pair[2]])
M_out[i,j,z] = out
count = count +1
return M_out
Here a way to do that. For efficiency, you need therefore to use ndarrays : This only take in account complete voxels. Edges must be managed "by hand".
from pylab import *
a=rand(100,100,100) # the data
r=4
ra=range(-r,r+1)
sphere=array([[x,y,z] for x in ra for y in ra for z in ra if np.abs((x,y,z)).sum()<=r])
# the unit "sphere"
indcenters=array(meshgrid(*(range(r,n-r) for n in a.shape),indexing='ij'))
# indexes of the centers of the voxels. edges are cut.
all_inds=(indcenters[newaxis].T+sphere.T).T
#all the indexes.
voxels=np.stack([a[tuple(inds)] for inds in all_inds],-1)
# the voxels.
#voxels.shape is (92, 92, 92, 129)
All the costly operations are vectorized. Comprehension lists are prefered for clarity in external loop.
You can now perform vectorized operations on voxels. for exemple the brightest voxel :
light=voxels.sum(-1)
print(np.unravel_index(light.argmax(),light.shape))
#(33,72,64)
All of this is of course extensive in memory. you must split your space for
big data or voxels.
Since you say the data structure is too large, you'll likely have to compute the vector on the fly for a given voxel. You can do this pretty quickly though:
class SearchLight(object):
def __init__(self, M_in, radius):
self.M_in = M_in
m, n, k = self.M_in.shape
# compute the sphere coordinates centered at (0,0,0)
# just like in your sample code
i_interval = list(range(-radius,radius+1))
j_interval = list(range(-radius,radius+1))
z_interval = list(range(-radius,radius+1))
coordinates = list(itertools.product(i_interval,j_interval,z_interval))
coordinates = [pair for pair in coordinates if ((abs(pair[0])+abs(pair[1])+abs(pair[2]))<=radius)]
# store those indices as a template
self.sphere_indices = np.array(coordinates)
def get_vector(self, i, j, k):
# offset sphere coordinates by the requested centre.
coordinates = self.sphere_indices + [i,j,k]
# filter out of bounds coordinates
coordinates = coordinates[(coordinates >= 0).all(1)]
coordinates = coordinates[(coordinates < self.M_in.shape).all(1)]
# use those coordinates to index the initial array.
return self.M_in[coordinates[:,0], coordinates[:,1], coordinates[:,2]]
To use the object on a given array you can simply do:
sl = SearchLight(M_in, 4)
# get vector of values for voxel i,j,k
vector = sl.get_vector(i,j,k)
This should give you the same vector you would get from
M_out[i,j,k]
in your sample code, without storing all the results at once in memory.
This can also probably be further optimized, particularly in terms of the coordinate filtering, but it may not be necessary. Hope that helps.
I'm simulating a 2-dimensional random walk, with direction 0 < θ < 2π and T=1000 steps. I already have a code which simulates a single walk, repeats it 12 times, and saves each run into sequentially named text files:
a=np.zeros((1000,2), dtype=np.float)
print a # Prints array with zeros as entries
# Single random walk
def randwalk(x,y): # Defines the randwalk function
theta=2*math.pi*rd.rand()
x+=math.cos(theta);
y+=math.sin(theta);
return (x,y) # Function returns new (x,y) coordinates
x, y = 0., 0. # Starting point is the origin
for i in range(1000): # Walk contains 1000 steps
x, y = randwalk(x,y)
a[i,:] = x, y # Replaces entries of a with (x,y) coordinates
# Repeating random walk 12 times
fn_base = "random_walk_%i.txt" # Saves each run to sequentially named .txt
for j in range(12):
rd.seed() # Uses different random seed for every run
x, y = 0., 0.
for i in range(1000):
x, y = randwalk(x,y)
a[i,:] = x, y
fn = fn_base % j # Allocates fn to the numbered file
np.savetxt(fn, a) # Saves run data to appropriate text file
Now I want to calculate the mean square displacement over all 12 walks. To do this, my initial thought was to import the data from each text file back into a numpy array, eg:
infile="random_walk_0.txt"
rw0dat=np.genfromtxt(infile)
print rw0dat
And then somehow manipulate the arrays to find the mean square displacement.
Is there a more efficient way to go about finding the MSD with what I have?
Here is a quick snipet to compute the mean square displacement (MSD).
Where path is made of points equally spaced in time, as it seems to be the case
for your randwalk. You can just place in the 12-walk for loop and compute it for each a[i,:]
#input path =[ [x1,y1], ... ,[xn,yn] ].
def compute_MSD(path):
totalsize=len(path)
msd=[]
for i in range(totalsize-1):
j=i+1
msd.append(np.sum((path[0:-j]-path[j::])**2)/float(totalsize-j))
msd=np.array(msd)
return msd
First, you don't actually need to store the whole 1000-step walk, just the final position.
Also, there's no reason to store them out to textfiles and load them back, you can just use them in-memory—just put them in a list of arrays, or in an array of 1 more dimension. Even if you need to write them out, you can do that as well as keeping the final values, instead of in place of. (Also, if you're not actually using numpy for performance or simplicity in building the 2D array, you might want to consider building it iteratively, e.g., using the csv module, but that one's more of a judgment call.)
At any rate, given your 12 final positions, you just calculate the distance of each one from (0, 0), then square that, sum them all, and divide by 12. (Or, since the obvious way to compute the distance from (0, 0) is to just add the squares of the x and y positions and then squareroot the result, just skip the squareroot and square at the end.)
But if you want to store each whole walk into a file for some reason, then after you load them back in, walk[-1] gives you the final position as a 1D array of 2 values. So, you can either read those 12 final positions into a 12x2 array and vectorize the mean square distance, or just accumulate them in a list and do it manually.
While we're at it, the rd.seed() isn't necessary; the whole point of a PRNG is that you continue to get different numbers unless you explicitly reset the seed to its original value to repeat them.
Here's an example of dropping the two extra complexities and doing everything directly:
destinations = np.zeros((12, 2), dtype=np.float)
for j in range(12):
x, y = 0., 0.
for i in range(1000):
x, y = randwalk(x, y)
destinations[j] = x, y
square_distances = destinations[:,0] ** 2 + destinations[:,1] ** 2
mean_square_distance = np.mean(square_distances)
So this is a little follow up question to my earlier question: Generate coordinates inside Polygon and my answer https://stackoverflow.com/a/15243767/1740928
In fact, I want to bin polygon data to a regular grid. Therefore, I calculate a couple of coordinates within the polygon and translate their lat/lon combination to their respective column/row combo of the grid.
Currently, the row/column information is stored in a numpy array with its number of rows corresponding to the number of data polygons and its number of columns corresponding to the coordinates in the polygon.
The whole code takes less then a second, but this code is the bottleneck at the moment (with ~7sec):
for ii in np.arange(len(data)):
for cc in np.arange(data_lats.shape[1]):
final_grid[ row[ii,cc], col[ii,cc] ] += data[ii]
final_grid_counts[ row[ii,cc], col[ii,cc] ] += 1
The array "data" simply contains the data values for each polygon (80000,). The arrays "row" and "col" contain the row and column number of a coordinate in the polygon (shape: (80000,16)).
As you can see, I am summing up all data values within each grid cell and count the number of matches. Thus, I know the average for each grid cell in case different polygons intersect it.
Still, how can these two for loops take around 7 seconds? Can you think of a faster way?
I think numpy should add an nd-bincount function, I had one lying around from a project I was working on some time ago.
import numpy as np
def two_d_bincount(row, col, weights=None, shape=None):
if shape is None:
shape = (row.max() + 1, col.max() + 1)
row = np.asarray(row, 'int')
col = np.asarray(col, 'int')
x = np.ravel_multi_index([row, col], shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
return out.reshape(shape)
weights = np.column_stack([data] * row.shape[1])
final_grid = two_d_bincount(row.ravel(), col.ravel(), weights.ravel())
final_grid_counts = two_d_bincount(row.ravel(), col.ravel())
I hope this helps.
I might not fully understand the shapes of your different grids, but you can maybe eliminate the cc loop using something like this:
final_grid = np.empty((nrows,ncols))
for ii in xrange(len(data)):
final_grid[row[ii,:],col[ii,:]] = data[ii]
This of course assumes that final_grid is starting with no other info (that the count you're incrementing starts at zero). And I'm not sure how to test if it works not understanding how your row and col arrays work.