I'm simulating a 2-dimensional random walk, with direction 0 < θ < 2π and T=1000 steps. I already have a code which simulates a single walk, repeats it 12 times, and saves each run into sequentially named text files:
a=np.zeros((1000,2), dtype=np.float)
print a # Prints array with zeros as entries
# Single random walk
def randwalk(x,y): # Defines the randwalk function
theta=2*math.pi*rd.rand()
x+=math.cos(theta);
y+=math.sin(theta);
return (x,y) # Function returns new (x,y) coordinates
x, y = 0., 0. # Starting point is the origin
for i in range(1000): # Walk contains 1000 steps
x, y = randwalk(x,y)
a[i,:] = x, y # Replaces entries of a with (x,y) coordinates
# Repeating random walk 12 times
fn_base = "random_walk_%i.txt" # Saves each run to sequentially named .txt
for j in range(12):
rd.seed() # Uses different random seed for every run
x, y = 0., 0.
for i in range(1000):
x, y = randwalk(x,y)
a[i,:] = x, y
fn = fn_base % j # Allocates fn to the numbered file
np.savetxt(fn, a) # Saves run data to appropriate text file
Now I want to calculate the mean square displacement over all 12 walks. To do this, my initial thought was to import the data from each text file back into a numpy array, eg:
infile="random_walk_0.txt"
rw0dat=np.genfromtxt(infile)
print rw0dat
And then somehow manipulate the arrays to find the mean square displacement.
Is there a more efficient way to go about finding the MSD with what I have?
Here is a quick snipet to compute the mean square displacement (MSD).
Where path is made of points equally spaced in time, as it seems to be the case
for your randwalk. You can just place in the 12-walk for loop and compute it for each a[i,:]
#input path =[ [x1,y1], ... ,[xn,yn] ].
def compute_MSD(path):
totalsize=len(path)
msd=[]
for i in range(totalsize-1):
j=i+1
msd.append(np.sum((path[0:-j]-path[j::])**2)/float(totalsize-j))
msd=np.array(msd)
return msd
First, you don't actually need to store the whole 1000-step walk, just the final position.
Also, there's no reason to store them out to textfiles and load them back, you can just use them in-memory—just put them in a list of arrays, or in an array of 1 more dimension. Even if you need to write them out, you can do that as well as keeping the final values, instead of in place of. (Also, if you're not actually using numpy for performance or simplicity in building the 2D array, you might want to consider building it iteratively, e.g., using the csv module, but that one's more of a judgment call.)
At any rate, given your 12 final positions, you just calculate the distance of each one from (0, 0), then square that, sum them all, and divide by 12. (Or, since the obvious way to compute the distance from (0, 0) is to just add the squares of the x and y positions and then squareroot the result, just skip the squareroot and square at the end.)
But if you want to store each whole walk into a file for some reason, then after you load them back in, walk[-1] gives you the final position as a 1D array of 2 values. So, you can either read those 12 final positions into a 12x2 array and vectorize the mean square distance, or just accumulate them in a list and do it manually.
While we're at it, the rd.seed() isn't necessary; the whole point of a PRNG is that you continue to get different numbers unless you explicitly reset the seed to its original value to repeat them.
Here's an example of dropping the two extra complexities and doing everything directly:
destinations = np.zeros((12, 2), dtype=np.float)
for j in range(12):
x, y = 0., 0.
for i in range(1000):
x, y = randwalk(x, y)
destinations[j] = x, y
square_distances = destinations[:,0] ** 2 + destinations[:,1] ** 2
mean_square_distance = np.mean(square_distances)
Related
A square box of size 10,000*10,000 has 10,00,000 particles distributed uniformly. The box is divided into grids, each of size 100*100. There are 10,000 grids in total. At every time-step (for a total of 2016 steps), I would like to identify the grid to which a particle belongs. Is there an efficient way to implement this in python? My implementation is as below and currently takes approximately 83s for one run.
import numpy as np
import time
start=time.time()
# Size of the layout
Layout = np.array([0,10000])
# Total Number of particles
Population = 1000000
# Array to hold the cell number
cell_number = np.zeros((Population),dtype=np.int32)
# Limits of each cell
boundaries = np.arange(0,10100,step=100)
cell_boundaries = np.dstack((boundaries[0:100],boundaries[1:101]))
# Position of Particles
points = np.random.uniform(0,Layout[1],size = (Population,2))
# Generating a list with the x,y boundaries of each cell in the grid
x = []
limit_list = cell_boundaries
for i in range(0,Layout[1]//100):
for j in range(0,Layout[1]//100):
x.append([limit_list[0][i,0],limit_list[0][i,1],limit_list[0][j,0],limit_list[0][j,1]])
# Identifying the cell to which the particles belong
i=0
for y in (x):
cell_number[(points[:,1]>y[0])&(points[:,1]<y[1])&(points[:,0]>y[2])&(points[:,0]<y[3])]=i
i+=1
print(time.time()-start)
I am not sure about your code. You seem to be accumulating the i variable globally. While it should be accumulated on a per cell basis, correct? Something like cell_number[???] += 1, maybe?
Anyhow, the way I see is from a different perspective. You could start by assigning each point a cell id. Then inverse the resulting array with a kind of counter function. I have implemented the following in PyTorch, you will most likely find equivalent utilities in Numpy.
The conversion from 2-point coordinates to cell ids corresponds to applying floor on the coordinates then unfolding them according to your grid's width.
>>> p = torch.from_numpy(points).floor()
>>> p_unfold = p[:, 0]*10000 + p[:, 1]
Then you can "inverse" the statistics, i.e. find out how many particles there are in each respective cell based on the cell ids. This can be done using PyTorch histogram's counter torch.histc:
>>> torch.histc(p_unfold, bins=Population)
I am asking for help to speed up my program by replacing the for loop described below by something with numpy and without a for loop. It involves calculating the distance from a single point to every single point individually in an 2D-array holding multiple points. I have added code and comments below to make it more clear.
Thank you very much for your help.
# some random point
current_point = np.array(423,629)
distances = []
# x and y coordinates of the points saved in np arrays of shape(1,number of points)
# example For three points the xi array could look like: ([231,521,24])
xi = x[indices]
yi = y[indices]
# to get a combined array holding both x and y coordinates. of shape (number of points,2)
x_y = np.vstack((xi, yi)).reshape(-1,2)
# this is the part that i need to speed up. Calculating the distance from the current_point to every point in x_y.
for i in range(xi.shape[0]):
b = np.array(x[i],y[i])
distance = np.linalg.norm(current_point-b)
distances.append(distance)
min_distance = min(distances)
Replace your for loop with this one-liner:
distances = np.linalg.norm(x_y - current_point, axis=1)
you're not properly stacking the coordinate array, it should be transposed not reshaped
x_y = np.vstack((xi, yi)).T
then to find the distances
distances = np.linalg.norm(current_point - x_y, axis= -1)
There's some context to this, so bear with me please.
I have a list of lists, call it nested_lists, where each list is of the form [[1,2,3,...], [4,3,1,...]] (i.e. each list contains two lists of integers). Now, in each of these lists, the two lists of integers have the same length and two integers corresponding to the same index represent a coordinate in R^2.
So for example, (1,4) would be one coordinate from the above example.
Now, my task is to draw 5 unique coordinates from nested_lists uniformly (i.e. each coordinate has the same probability of being chosen), without replacement. That is, from all of the coordinates from the lists in nested_lists, I am trying to draw 5 unique coordinates uniformly without replacement.
One very straightforward way to do this would be to : 1. Create a list of ALL the unique coordinates in nested_lists. 2. Use numpy.random.choice to sample 5 elements uniformly without replacement.
The code would be something like this:
import numpy as np
coordinates = []
#Get list of all unique coordinates
for list in nested_lists:
l = len(list[0])
for i in range(0, l):
coordinate = (list[0][i], list[1][i])
if coordinate not coordinates:
coordinates += [coordinate]
draws = np.random.choice(coordinates, 5, replace=False, p= [1/len(coordinates)]*len(coordinates))
But getting a set of all the unique coordinates can be very computationally expensive, especially if nested_lists contains millions of lists, each with thousands of coordinates in them. So I'm looking for methods to perform the same draws without having to get a list of all the coordinates first.
One method I thought of would be to sample with weighted probabilities from each list in nested_lists.
So get a list of the sizes (number of coordinates) of each list, and then go through each list and draw a coordinate with probability (size/sum(size))*(1/sum(sizes)). Repeating the process until 5 unique coordinates are drawn should then correspond to what we wanted to draw. The code would be something like this:
no_coordinates = lambda x: len(x[0])
sizes = list(map(no_coordinates, nested_lists))
i = 0
sum_sizes = sum(sizes)
draws = []
while i != 5: #to make sure we get 5 draws
for list in nested_lists:
size = len(list[0])
p = size/(sum_sizes**2)
for j in range(0, size):
if i >= 5: exit for loop when we reach 5 draws
break
if np.random.random() < p and (list[0][j], list[1][j]) not in draws:
draws += (list[0][j], list[1][j])
i += 1
The code above seems to be more computationally efficient, but I am not sure if it actually draws with the same probability that would be required overall. From my calculation, the overall probability would sum(size)/sum_sizes**2 which is the same as 1/sum_sizes (our required probability), but again, I'm not sure if this is correct.
So I was wondering if there are more efficient approaches to drawing like I want, and if my approach is actually correct or not.
You can use bootstrapping. Basically, the idea is to draw some large (but fixed) amount of coordinates with replacement to estimate probability of each coordinate. Then, you can subsample from this list using transformed densities.
from collections import Counter
bootstrap_sample_size = 1000
total_lists = len(nested_lists)
list_len = len(nested_lists[0])
# set will make more sense in this example
# I used counter to allow for future statistical manipulations
c = Counter()
for _ in range(bootstrap_sample_size):
x, y = random.randrange(total_lists), random.randrange(list_len)
random_point = nested_lists[x][0][y], nested_lists[x][1][y]
c.update((random_point,))
# now c contains counts for 1000 points with replacements
# let's just ignore these probabilities to get uniform sample
result = random.sample(c.keys(), 5)
This will not be exactly uniform, but bootstrap provides statistical guarantees that it will be arbitrary close to uniform distribution as the bootstrap_sample_size is increased. 1000 samples is usually enough for most real-life applications.
This question is somewhat similar to this. I've gone a bit farther than the OP, though, and I'm in Python 2 (not sure what he was using).
I have a Python function that can determine the distance from a point inside a convex polygon to regularly-defined intervals along the polygon's perimeter. The problem is that it returns "extra" distances that I need to eliminate. (Please note--I suspect this will not work for rectangles yet. I'm not finished with it.)
First, the code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# t1.py
#
# Copyright 2015 FRED <fred#matthew24-25>
#
# THIS IS TESTING CODE ONLY. IT WILL BE MOVED INTO THE CORRECT MODULE
# UPON COMPLETION.
#
from __future__ import division
import math
import matplotlib.pyplot as plt
def Dist(center_point, Pairs, deg_Increment):
# I want to set empty lists to store the values of m_lnsgmnt and b_lnsgmnts
# for every iteration of the for loop.
m_linesegments = []
b_linesegments = []
# Scream and die if Pairs[0] is the same as the last element of Pairs--i.e.
# it has already been run once.
#if Pairs[0] == Pairs[len(Pairs)-1]:
##print "The vertices contain duplicate points!"
## Creates a new list containing the original list plus the first element. I did this because, due
## to the way the for loop is set up, the last iteration of the loop subtracts the value of the
## last value of Pairs from the first value. I therefore duplicated the first value.
#elif:
new_Pairs = Pairs + [Pairs[0]]
# This will calculate the slopes and y-intercepts of the linesegments of the polygon.
for a in range(len(Pairs)):
# This calculates the slope of each line segment and appends it to m_linesegments.
m_lnsgmnt = (new_Pairs[a+1][2] - new_Pairs[a][3]) / (new_Pairs[a+1][0] - new_Pairs[a][0])
m_linesegments.append(m_lnsgmnt)
# This calculates the y-intercept of each line segment and appends it to b_linesegments.
b_lnsgmnt = (Pairs[a][4]) - (m_lnsgmnt * Pairs[a][0])
b_linesegments.append(b_lnsgmnt)
# These are temporary testing codes.
print "m_linesegments =", m_linesegments
print "b_linesegments =", b_linesegments
# I want to set empty lists to store the value of m_rys and b_rys for every
# iteration of the for loop.
m_rays = []
b_rays = []
# I need to set a range of degrees the intercepts will be calculated for.
theta = range(0, 360, deg_Increment)
# Temporary testing line.
print "theta =", theta
# Calculate the slope and y-intercepts of the rays radiating from the center_point.
for b in range(len(theta)):
m_rys = math.tan(math.radians(theta[b]))
m_rays.append(m_rys)
b_rys = center_point[1] - (m_rys * center_point[0])
b_rays.append(b_rys)
# Temporary testing lines.
print "m_rays =", m_rays
print "b_rays =", b_rays
# Set empty matrix for Intercepts.
Intercepts = []
angle = []
# Calculate the intersections of the rays with the line segments.
for c in range((360//deg_Increment)):
for d in range(len(Pairs)):
# Calculate the x-coordinates and the y-coordinates of each
# intersection
x_Int = (b_rays[c] - b_linesegments[d]) / (m_linesegments[d] - m_rays[c])
y_Int = ((m_linesegments[d] * x_Int) + b_linesegments[d])
Intercepts.append((x_Int, y_Int))
# Calculates the angle of the ray. Rounding is necessary to
# compensate for binary-decimal errors.
a_ngle = round(math.degrees(math.atan2((y_Int - center_point[1]), (x_Int - center_point[0]))))
# Substitutes positive equivalent for every negative angle,
# i.e. -270 degrees equals 90 degrees.
if a_ngle < 0:
a_ngle = a_ngle + 360
# Selects the angles that correspond to theta
if a_ngle == theta[c]:
angle.append(a_ngle)
print "INT1=", Intercepts
print "angle=", angle
dist = []
# Calculates distance.
for e in range(len(Intercepts) - 1):
distA = math.sqrt(((Intercepts[e][0] - center_point[0])**2) + ((Intercepts[e][5]- center_point[1])**2))
dist.append(distA)
print "dist=", dist
if __name__ == "__main__":
main()
Now, as to how it works:
The code takes 3 inputs: center_point (a point contained in the polygon, given in (x,y) coordinates), Pairs (the vertices of the polygon, also given in (x,y) coordinats), and deg_Increment ( which defines how often to calculate distance).
Let's assume that center_point = (4,5), Pairs = [(1, 4), (3, 8), (7, 2)], and deg_Increment = 20. This means that a polygon is created (sort of) whose vertices are Pairs, and center_point is a point contained inside the polygon.
Now rays are set to radiate from center_point every 20 degrees (which isdeg_Increment). The intersection points of the rays with the perimeter of the polygon are determined, and the distance is calculated using the distance formula.
The only problem is that I'm getting too many distances. :( In my example above, the correct distances are
1.00000 0.85638 0.83712 0.92820 1.20455 2.07086 2.67949 2.29898 2.25083 2.50000 3.05227 2.22683 1.93669 1.91811 2.15767 2.85976 2.96279 1.40513
But my code is returning
dist= [2.5, 1.0, 6.000000000000001, 3.2523178818773006, 0.8563799085248148, 3.0522653889161626, 5.622391569468206, 0.8371216462519347, 2.226834844885431, 37.320508075688686, 0.9282032302755089, 1.9366857335569072, 7.8429970322236064, 1.2045483557883576, 1.9181147622136665, 3.753460385470896, 2.070863609380179, 2.157671808913309, 2.6794919243112276, 12.92820323027545, 2.85976265663383, 2.298981118867903, 2.962792920643178, 5.162096782237789, 2.250827351906659, 1.4051274947736863, 69.47032761621092, 2.4999999999999996, 1.0, 6.000000000000004, 3.2523178818773006, 0.8563799085248148, 3.0522653889161626, 5.622391569468206, 0.8371216462519347, 2.226834844885431, 37.32050807568848, 0.9282032302755087, 1.9366857335569074, 7.842997032223602, 1.2045483557883576, 1.9181147622136665, 3.7534603854708997, 2.0708636093801767, 2.1576718089133085, 2.679491924311227, 12.928203230275532, 2.85976265663383, 2.298981118867903, 2.9627929206431776, 5.162096782237789, 2.250827351906659, 1.4051274947736847]
If anyone can help me get only the correct distances, I'd greatly appreciate it.
Thanks!
And just for reference, here's what my example looks like with the correct distances only:
You're getting too many values in Intercepts because it's being appended to inside the second for-loop [for d in range(len(Pairs))].
You only want one value in Intercept per step through the outer for-loop [for c in range((360//deg_Increment))], so the append to Intercept needs to be in this loop.
I'm not sure what you're doing with the inner loop, but you seem to be calculating a separate intercept for each of the lines that make up the polygon sides. But you only want the one that you're going to hit "first" when going in that direction.
You'll have to add some code to figure out which of the 3 (in this case) sides of the polygon you're actually going to encounter first.
Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.