A square box of size 10,000*10,000 has 10,00,000 particles distributed uniformly. The box is divided into grids, each of size 100*100. There are 10,000 grids in total. At every time-step (for a total of 2016 steps), I would like to identify the grid to which a particle belongs. Is there an efficient way to implement this in python? My implementation is as below and currently takes approximately 83s for one run.
import numpy as np
import time
start=time.time()
# Size of the layout
Layout = np.array([0,10000])
# Total Number of particles
Population = 1000000
# Array to hold the cell number
cell_number = np.zeros((Population),dtype=np.int32)
# Limits of each cell
boundaries = np.arange(0,10100,step=100)
cell_boundaries = np.dstack((boundaries[0:100],boundaries[1:101]))
# Position of Particles
points = np.random.uniform(0,Layout[1],size = (Population,2))
# Generating a list with the x,y boundaries of each cell in the grid
x = []
limit_list = cell_boundaries
for i in range(0,Layout[1]//100):
for j in range(0,Layout[1]//100):
x.append([limit_list[0][i,0],limit_list[0][i,1],limit_list[0][j,0],limit_list[0][j,1]])
# Identifying the cell to which the particles belong
i=0
for y in (x):
cell_number[(points[:,1]>y[0])&(points[:,1]<y[1])&(points[:,0]>y[2])&(points[:,0]<y[3])]=i
i+=1
print(time.time()-start)
I am not sure about your code. You seem to be accumulating the i variable globally. While it should be accumulated on a per cell basis, correct? Something like cell_number[???] += 1, maybe?
Anyhow, the way I see is from a different perspective. You could start by assigning each point a cell id. Then inverse the resulting array with a kind of counter function. I have implemented the following in PyTorch, you will most likely find equivalent utilities in Numpy.
The conversion from 2-point coordinates to cell ids corresponds to applying floor on the coordinates then unfolding them according to your grid's width.
>>> p = torch.from_numpy(points).floor()
>>> p_unfold = p[:, 0]*10000 + p[:, 1]
Then you can "inverse" the statistics, i.e. find out how many particles there are in each respective cell based on the cell ids. This can be done using PyTorch histogram's counter torch.histc:
>>> torch.histc(p_unfold, bins=Population)
Related
I have a set of sphere coordinates in 3D that evolves.
They represent a stack of spheres which are continuously removed from a box from the bottom of the geometry, and reinserted at the top at a random location. Since this kind of simulation is really periodic, I would like to simulate the drainage of the box a few times (say, 5 times, so t=1 takes positions 1 -> t=5 takes positions 5), and then come back to the first state to simulate the next steps (t=6 takes position 1, t=10 takes positions 5, same for t=11->15, etc.)
The problem is that at the coordinates of a given sphere (say, sphere 1) can be very different from the first state to the last simulated one. However, it is very important, for the sake of the simulation, to have a simulation as smooth as possible. If I had to quantify it, I would say that I need the distance between state 5 and state 6 for each pebble to be as low as possible.
It seems to me like an assignment problem. Is there any known solution and method for this kind of problems?
Here is an example of what I would like to have (I mostly use Python):
import numpy as np
# Mockup of the simulation positions
Nspheres = 100
Nsteps = 5 # number of simulated steps
coordinates = np.random.uniform(0,100, (Nsteps, Nspheres, 3)) # mockup x,y,z for each step
initial_positions = coordinates[0]
final_positions = coordinates[Nsteps-1]
**indices_adjust_initial_positions = adjust_initial_positions(initial_positions, final_positions) # to do**
adjusted_initial_positions = initial_positions[indices_adjust_initial_positions]
# Quantification of error made
mean_error = np.mean(np.abs(final_positions-adjusted_initial_positions))
max_error = np.max(np.abs(final_positions-adjusted_initial_positions))
print(mean_error, max_error)
# Assign it for each "cycle"
Ncycles = 5 # Number of times the simulation is repeated
simulation_coordinates = np.empty((Nsteps*Ncycles, Nspheres, 3))
simulation_coordinates[:Nsteps] = np.array(coordinates)
for n in range(1, Ncycles):
new_cycle_coordinates = simulation_coordinates[Nsteps*(n-1):Nsteps*(n):, indices_adjust_initial_positions, :]
simulation_coordinates[Nsteps*n:Nsteps*(n+1)] = new_cycle_coordinates
# Print result
print(simulation_coordinates)
The adjust_initial_positions would therefore take the initial and final states, and determine what would be the ideal set of indices to apply to the initial state to look the most like the final state. Please note that if that makes the problem any simpler, I do not really care if the very top spheres are not really matching between the two states, however it is important to be as close as possible at more towards the bottom.
Would you have any suggestion?
After some research, it seems that scipy.optimize has some nice features able to do something like it. If list1 is my first step, list2 is my last simulated step, we can do something like:
cost = np.linalg.norm(list2[:, np.newaxis, :] - list1, axis=2)
_, indexes = scipy.optimize.linear_sum_assignment(cost)
list3 = list1[indexes]
Therefore, list3 will be as close as list2 as possible thanks to the index sorting, while taking the positions of list1.
I have a problem where in a grid of x*y size I am provided a single dot, and I need to find the nearest neighbour. In practice, I am trying to find the closest dot to the cursor in pygame that crosses a color distance threshold that is calculated as following:
sqrt(((rgb1[0]-rgb2[0])**2)+((rgb1[1]-rgb2[1])**2)+((rgb1[2]-rgb2[2])**2))
So far I have a function that calculates the different resolutions for the grid and reduces it by a factor of two while always maintaining the darkest pixel. It looks as following:
from PIL import Image
from typing import Dict
import numpy as np
#we input a pillow image object and retrieve a dictionary with every grid version of the 3 dimensional array:
def calculate_resolutions(image: Image) -> Dict[int, np.ndarray]:
resolutions = {}
#we start with the highest resolution image, the size of which we initially divide by 1, then 2, then 4 etc.:
divisor = 1
#reduce the grid by 5 iterations
resolution_iterations = 5
for i in range(resolution_iterations):
pixel_lookup = image.load() #convert image to PixelValues object, which allows for pixellookup via [x,y] index
#calculate the resolution of the new grid, round upwards:
resolution = (int((image.size[0] - 1) // divisor + 1), int((image.size[1] - 1) // divisor + 1))
#generate 3d array with new grid resolution, fill in values that are darker than white:
new_grid = np.full((resolution[0],resolution[1],3),np.array([255,255,255]))
for x in range(image.size[0]):
for y in range(image.size[1]):
if not x%divisor and not y%divisor:
darkest_pixel = (255,255,255)
x_range = divisor if x+divisor<image.size[0] else (0 if image.size[0]-x<0 else image.size[0]-x)
y_range = divisor if y+divisor<image.size[1] else (0 if image.size[1]-y<0 else image.size[1]-y)
for x_ in range(x,x+x_range):
for y_ in range(y,y+y_range):
if pixel_lookup[x_,y_][0]+pixel_lookup[x_,y_][1]+pixel_lookup[x_,y_][2] < darkest_pixel[0]+darkest_pixel[1]+darkest_pixel[2]:
darkest_pixel = pixel_lookup[x_,y_]
if darkest_pixel != (255,255,255):
new_grid[int(x/divisor)][int(y/divisor)] = np.array(darkest_pixel)
resolutions[i] = new_grid
divisor = divisor*2
return resolutions
This is the most performance efficient solution I was able to come up with. If this function is run on a grid that continually changes, like a video with x fps, it will be very performance intensive. I also considered using a kd-tree algorithm that simply adds and removes any dots that happen to change on the grid, but when it comes to finding individual nearest neighbours on a static grid this solution has the potential to be more resource efficient. I am open to any kinds of suggestions in terms of how this function could be improved in terms of performance.
Now, I am in a position where for example, I try to find the nearest neighbour of the current cursor position in a 100x100 grid. The resulting reduced grids are 50^2, 25^2, 13^2, and 7^2. In a situation where a part of the grid looks as following:
And I am on the aggregation step where a part of the grid consisting of six large squares, the black one being the current cursor position and the orange dots being dots where the color distance threshold is crossed, I would not know which diagonally located closest neighbour I would want to pick to search next. In this case, going one aggregation step down shows that the lower left would be the right choice. Depending on how many grid layers I have this could result in a very large error in terms of the nearest neighbour search. Is there a good way how I can solve this problem? If there are multiple squares that show they have a relevant location, do I have to search them all in the next step to be sure? And if that is the case, the further away I get the more I would need to make use of math functions such as the pythagorean theorem to assert whether the two positive squares I find are overlapping in terms of distance and could potentially contain the closest neighbour, which would start to be performance intensive again if the function is called frequently. Would it still make sense to pursue this solution over a regular kd tree? For now the grid size is still fairly small (~800-600) but if the grid gets larger the performance may start suffering again. Is there a good scalable solution to this problem that could be applied here?
There's some context to this, so bear with me please.
I have a list of lists, call it nested_lists, where each list is of the form [[1,2,3,...], [4,3,1,...]] (i.e. each list contains two lists of integers). Now, in each of these lists, the two lists of integers have the same length and two integers corresponding to the same index represent a coordinate in R^2.
So for example, (1,4) would be one coordinate from the above example.
Now, my task is to draw 5 unique coordinates from nested_lists uniformly (i.e. each coordinate has the same probability of being chosen), without replacement. That is, from all of the coordinates from the lists in nested_lists, I am trying to draw 5 unique coordinates uniformly without replacement.
One very straightforward way to do this would be to : 1. Create a list of ALL the unique coordinates in nested_lists. 2. Use numpy.random.choice to sample 5 elements uniformly without replacement.
The code would be something like this:
import numpy as np
coordinates = []
#Get list of all unique coordinates
for list in nested_lists:
l = len(list[0])
for i in range(0, l):
coordinate = (list[0][i], list[1][i])
if coordinate not coordinates:
coordinates += [coordinate]
draws = np.random.choice(coordinates, 5, replace=False, p= [1/len(coordinates)]*len(coordinates))
But getting a set of all the unique coordinates can be very computationally expensive, especially if nested_lists contains millions of lists, each with thousands of coordinates in them. So I'm looking for methods to perform the same draws without having to get a list of all the coordinates first.
One method I thought of would be to sample with weighted probabilities from each list in nested_lists.
So get a list of the sizes (number of coordinates) of each list, and then go through each list and draw a coordinate with probability (size/sum(size))*(1/sum(sizes)). Repeating the process until 5 unique coordinates are drawn should then correspond to what we wanted to draw. The code would be something like this:
no_coordinates = lambda x: len(x[0])
sizes = list(map(no_coordinates, nested_lists))
i = 0
sum_sizes = sum(sizes)
draws = []
while i != 5: #to make sure we get 5 draws
for list in nested_lists:
size = len(list[0])
p = size/(sum_sizes**2)
for j in range(0, size):
if i >= 5: exit for loop when we reach 5 draws
break
if np.random.random() < p and (list[0][j], list[1][j]) not in draws:
draws += (list[0][j], list[1][j])
i += 1
The code above seems to be more computationally efficient, but I am not sure if it actually draws with the same probability that would be required overall. From my calculation, the overall probability would sum(size)/sum_sizes**2 which is the same as 1/sum_sizes (our required probability), but again, I'm not sure if this is correct.
So I was wondering if there are more efficient approaches to drawing like I want, and if my approach is actually correct or not.
You can use bootstrapping. Basically, the idea is to draw some large (but fixed) amount of coordinates with replacement to estimate probability of each coordinate. Then, you can subsample from this list using transformed densities.
from collections import Counter
bootstrap_sample_size = 1000
total_lists = len(nested_lists)
list_len = len(nested_lists[0])
# set will make more sense in this example
# I used counter to allow for future statistical manipulations
c = Counter()
for _ in range(bootstrap_sample_size):
x, y = random.randrange(total_lists), random.randrange(list_len)
random_point = nested_lists[x][0][y], nested_lists[x][1][y]
c.update((random_point,))
# now c contains counts for 1000 points with replacements
# let's just ignore these probabilities to get uniform sample
result = random.sample(c.keys(), 5)
This will not be exactly uniform, but bootstrap provides statistical guarantees that it will be arbitrary close to uniform distribution as the bootstrap_sample_size is increased. 1000 samples is usually enough for most real-life applications.
I have a large 4-dimensional dataset of Temperatures [time,pressure,lat,lon].
I need to find all grid points within a region defined by lat/lon indices and calculate an average over the region to leave me with a 2-dimensional array.
I know how to do this if my region is a rectangle (or square) but how can this be done with an irregular polygon?
Below is an image showing the regions I need to average together and the lat/lon grid the data is gridded to in the array
I believe this should solve your problem.
The code below generates all cells in a polygon defined by a list of vertices.
It "scans" the polygon row by row keeping track of the transition columns where you (re)-enter or exit the polygon.
def row(x, transitions):
""" generator spitting all cells in a row given a list of transition (in/out) columns."""
i = 1
in_poly = True
y = transitions[0]
while i < len(transitions):
if in_poly:
while y < transitions[i]:
yield (x,y)
y += 1
in_poly = False
else:
in_poly = True
y = transitions[i]
i += 1
def get_same_row_vert(i, vertices):
""" find all vertex columns in the same row as vertices[i], and return next vertex index as well."""
vert = []
x = vertices[i][0]
while i < len(vertices) and vertices[i][0] == x:
vert.append(vertices[i][1])
i += 1
return vert, i
def update_transitions(old, new):
""" update old transition columns for a row given new vertices.
That is: merge both lists and remove duplicate values (2 transitions at the same column cancel each other)"""
if old == []:
return new
if new == []:
return old
o0 = old[0]
n0 = new[0]
if o0 == n0:
return update_transitions(old[1:], new[1:])
if o0 < n0:
return [o0] + update_transitions(old[1:], new)
return [n0] + update_transitions(old, new[1:])
def polygon(vertices):
""" generator spitting all cells in the polygon defined by given vertices."""
vertices.sort()
x = vertices[0][0]
transitions, i = get_same_row_vert(0, vertices)
while i < len(vertices):
while x < vertices[i][0]:
for cell in row(x, transitions):
yield cell
x += 1
vert, i = get_same_row_vert(i, vertices)
transitions = update_transitions(transitions, vert)
# define a "strange" polygon (hook shaped)
vertices = [(0,0),(0,3),(4,3),(4,0),(3,0),(3,2),(1,2),(1,1),(2,1),(2,0)]
for cell in polygon(vertices):
print cell
# or do whatever you need to do
The general class of problems is called "Point in Polygon", where the (fairly) standard algorithm is based on drawing a test line through the point under consideration and counting the number of times it crosses polygon boundaries (its really cool/weird that it works so simply, I think). This is a really good overview which includes implementation information.
For your problem in particular, since each of your regions are defined based on a small number of square cells - I think a more brute-force approach might be better. Perhaps something like:
For each region, form a list of all of the (lat/lon) squares which define it. Depending on how your regions are defined, this may be trivial, or annoying...
For each point you are examining, figure out which square it lives in. Since the squares are so well behaves, you can do this manually using opposite corners of each square, or using a method like numpy.digitize.
Test whether the square the point lives in, is in one of the regions.
If you're still having trouble, please provide some more details about your problem (specifically, how your regions are defined) --- that will make it easier to offer advice.
Suppose I have a matrix of the form where first column is all x points, the second column is all y points, and then the third and fourth are indicator variables telling whether the point belongs to a particular 'cluster' (can be either 1 or 0; so if in column 3 I have 1 for a third row, it means that the point of the third row, belongs to say cluster 1, which is represented by column 3).
My question is, how do I create a figure, scatter plot all the points belonging to cluster 1 and then on the same plot have scatter of the remaining points in another color. In Matlab, I would just say figure, then hold on and write out my commands. I am new to plotting in Python and not sure how this would be performed.
EDIT:
I think I made it work. How would I however, change marker size, depending on which cluster the point belongs to
Let's start with how we'd do this in MATLAB.
Supposing you have N unique clusters, you can simply loop through as many clusters as you have and plot the points in a different colour. Also, we can change the marker size at each iteration. You'll need to use logical indexing to extract out the points that belong to each cluster. Given that your matrix is stored in M, something like this comes to mind:
rng(123); %// Set random seeds
%// Total number of clusters
N = max(M(:,3));
%// Create a colour map
cmap = rand(N,3);
%// Store point sizes per cluster
sizes = [10 14 18];
figure; hold on; %// Create a blank figure and hold for changes
for ii = 1 : N
%// Determine those points belonging to the ith cluster
ind = M(:,3) == ii;
%// Get the x and y coordinates
x = M(ind,1);
y = M(ind,2);
%// Plot the points in a different colour
plot(x,y,'.','Color', cmap(ii,:), 'MarkerSize', sizes(ii));
end
%// Create labels
labels = sprintfc('Label %d', 1:N);
%// Make our legend
legend(labels{:});
The code is pretty self explanatory, you need to define your matrix M and we determine the total number of clusters by taking the max of the third column. Next we create a random colour map which has as many rows as there are clusters and there are three columns corresponding to a unique RGB colour per cluster. Each row defines a colour for each cluster which we'll use when plotting.
Next we create an array of sizes where we store the radius of each point stored in an array per cluster. We create a blank figure, hold it for changes we make to the plot then we iterate over each cluster of points. For each cluster of points, figure out the right points in M to extract out through logical indexing, extract out the x and y coordinates for those points then plot these points on your figure in a scatter formation where we manually specify the colour as a RGB tuple as well as the desired marker size.
We then create a cell array of labels that denote which set of points each cluster belongs to, then show a legend illustrating which points belong to which clusters given this array of labels.
Generating random data with random labels, where we have 20 points uniformly distributed between [0,1] for both x and y and generating a random set of up to three labels:
rng(123);
M = [rand(20,2) randi(3,20,1)];
I get this plot when I run the above code:
To get the equivalent in Python, well that's pretty easy. It's just a transcription from MATLAB to Python and the plotting mechanisms are exactly the same. You're using matplotlib and so I'm assuming numpy can be used as it's a dependency.
As such, the equivalent code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
# Total number of clusters
N = int(np.max(M[:,2]))
# Create a colour map
cmap = np.random.rand(N, 3)
# Store point sizes per cluster
sizes = np.array([10, 14, 18]);
plt.figure(); # Create blank figure. No need to hold on
for ii in range(N):
# Determine those points belonging to the ith cluster
ind = M[:,2] == (ii+1)
# Get the x and y coordinates
x = M[ind,0];
y = M[ind,1];
# Plot the points in a different colour
# Also add in labels for legend
plt.plot(x,y,'.',color=tuple(cmap[ii]), markersize=sizes[ii], label='Cluster #' + str(ii+1))
# Make our legend
plt.legend()
# Show the image
plt.show()
I won't bother explaining this one because it's pretty much the same as what you see in the MATLAB code. There are some nuances, such as the way hold on works in matplotlib. You don't need to use hold on because any changes you make the figure will be remembered until you decide to show the figure. You also have the nuances where numpy and Python start indexing at 0 instead of 1.
Using the same generation data code like in MATLAB:
M = np.column_stack([np.random.rand(20,2), np.random.randint(1,4,size=(20,1))])
I get this figure: