Random walks and diffusive behaviour Python - python

The signature of diffusive behaviour is that the average of the square of the distance of the walker from the origin after t steps is proportional to the number of steps.
Repeat the process for several walkers and step sizes (the text suggests 500 walkers and up to 100 steps), and plot a graph similar to the one in the textbook to confirm this. Confirm also that the slope of such a graph is 1 as predicted.
I implemented the following code in Python and each time I run it the gradient I get is half the wanted value, and I could not find the mistake. Furthermore, this is the desired graph, and my graph. Can anyone spot what's wrong with the code?
import numpy as np
import matplotlib.pyplot as plt
for j in range(nwalks):
for i in range(1,nsteps):
if rnd<0.5:
y=np.mean(array, axis=0)
#generating a function of the form y=mx + c
func = np.poly1d(coeffs)
# Getting the trendline(y values)
trendline = func(i)
plt.plot(i,trendline, 'k')

I have the impression that your line x2avg[i]=(np.sum(x2))/(i+1) is wrong. It calculates the mean squared distance over all steps in of the ith walker.
Just remove the x2 and x2avg arrays and just use array[j, i] = x[i]**2 in every inner loop:
for j in range(nwalks):
for i in range(1,nsteps):
if rnd<0.5:
You do already calculate the mean in the very next line, which is the correct way:
y=np.mean(array, axis=0)


Integration of a function with discrete values

I want to do a integration without knowing the functional equation f(x). I also have only discrete values, which Python has connected by a plot. This one looks like this:
This is the code with the calculation for it:
import numpy as np
import matplotlib.pyplot as plt
import math as m
import loaddataa as ld
# Loading of the values
dataListStride = ld.loadData("../Projektpraktikum Binder/Data/1 Fabienne/Test1/left foot/50cm")
indexStrideData = 0
strideData = dataListStride[indexStrideData]
#%%Calculation of the horizontal acceleration
def horizontal(yAngle, yAcceleration, xAcceleration):
a = ((m.cos(m.radians(yAngle)))*yAcceleration)-((m.sin(m.radians(yAngle)))*xAcceleration)
return a
resultsHorizontal = list()
for i in range (len(strideData)):
strideData_yAngle = strideData.to_numpy()[i, 2]
strideData_xAcceleration = strideData.to_numpy()[i, 4]
strideData_yAcceleration = strideData.to_numpy()[i, 5]
resultsHorizontal.append(horizontal(strideData_yAngle, strideData_yAcceleration, strideData_xAcceleration))
resultsHorizontal.insert(0, 0)
print("The values are: " +str(resultsHorizontal))
print("There are " +str(len(resultsHorizontal)) + " values.")
#x-axis "convert" into time: 100 Hertz makes 0.01 seconds
scale_factor = 0.01
x_values = np.arange(len(resultsHorizontal)) * scale_factor
plt.plot(x_values, resultsHorizontal)
After the calculation I get a list of these values (which were shown and plotted in the diagram above):
Note about the code:
The code works as follows: By using loaddataa.py a csv file is reading in. Then the formula for the calculation of the horizontal acceleration is defined, which is represented in def horizontal(yAngle,yAcceleration, xAcceleration). In the for loop, the previously determined list is run through line by line. Columns 2, 4 and 5 of the CSV file are used here. Then a 0 is added to the beginning of the resulting list of values. This is important to perform the integration from 0 to the end.
Now I want to integrate this function (which is represented in the plot at the top) with these values (which can be seen in the image after the code) after the calculation.
Is there a way to implement this? If so, how and what would the plot look like? Maybe there is the opportunity to do this with a trapeze integration? Thanks for helping me!
At the end of my task I want to do a double integration with the acceleration values to get the path length. The first (trapezoidal) integration of the acceleration should represent the velocity and the second (trapezoidal) integration the path length (location). The x-axis should remain as it is.
What I just noticed are the negative values. Theoretically the integration should always result in positive values, right? Because there are no negative areas.

Fit a line segment to a set of points

I'm trying to fit a line segment to a set of points but I have trouble finding an algorithm for it. I have a 2D line segment L and a set of 2D points C. L can be represented in any suitable way (I don't care), like support and definition vector, two points, a linear equation with left and right bound, ... The only important thing is that the line has a beginning and an end, so it's not infinite.
I want to fit L in C, so that the sum of all distances of c to L (where c is a point in C) is minimized. This is a least squares problem but I (think) cannot use polynmoial fitting, because L is only a segment. My mathematical knowledge in that area is a bit lacking so any hints on further reading would be appreciated aswell.
Here is an illustration of my problem:
The orange line should be fitted to the blue points so that the sum of squares of distances of each point to the line is minimal. I don't mind if the solution is in a different language or not code at all, as long as I can extract an algorithm from it.
Since this is more of a mathematical question I'm not sure if it's ok for SO or should be moved to cross validated or math exchange.
This solution is relatively similar to one already posted here, but I think is slightly more efficient, elegant and understandable, which is why I posted it despite the similarity.
As was already written, the min(max(...)) formulation makes it hard to solve this problem analytically, which is why scipy.optimize fits well.
The solution is based on the mathematical formulation for distance between a point and a finite line segment outlined in https://math.stackexchange.com/questions/330269/the-distance-from-a-point-to-a-line-segment
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize, NonlinearConstraint
def calc_distance_from_point_set(v_):
#v_ is accepted as 1d array to make easier with scipy.optimize
#Reshape into two points
v = (v_[:2].reshape(2, 1), v_[2:].reshape(2, 1))
#Calculate t* for s(t*) = v_0 + t*(v_1-v_0), for the line segment w.r.t each point
t_star_matrix = np.minimum(np.maximum(np.matmul(P-v[0].T, v[1]-v[0]) / np.linalg.norm(v[1]-v[0])**2, 0), 1)
#Calculate s(t*)
s_t_star_matrix = v[0]+((t_star_matrix.ravel())*(v[1]-v[0]))
#Take distance between all points and respective point on segment
distance_from_every_point = np.linalg.norm(P.T -s_t_star_matrix, axis=0)
return np.sum(distance_from_every_point)
if __name__ == '__main__':
#Random points from bounding box
box_1 = np.random.uniform(-5, 5, 20)
box_2 = np.random.uniform(-5, 5, 20)
P = np.stack([box_1, box_2], axis=1)
segment_length = 3
segment_length_constraint = NonlinearConstraint(fun=lambda x: np.linalg.norm(np.array([x[0], x[1]]) - np.array([x[2] ,x[3]])), lb=[segment_length], ub=[segment_length])
point = minimize(calc_distance_from_point_set, (0.0,-.0,1.0,1.0), options={'maxiter': 100, 'disp': True},constraints=segment_length_constraint).x
plt.scatter(box_1, box_2)
plt.plot([point[0], point[2]], [point[1], point[3]])
Example result:
Here is a proposition in python. The distance between the points and the line is computed based on the approach proposed here: Fit a line segment to a set of points
The fact that the segment has a finite length, which impose the usage of min and max function, or if tests to see whether we have to use perpendicular distance or distance to one of the end points, makes really difficult (impossible?) to get an analytic solution.
The proposed solution will thus use optimization algorithm to approach the best solution. It uses scipy.optimize.minimize, see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
Since the segment length is fixed, we have only three degrees of freedom. In the proposed solution I use x and y coordinate of the starting segment point and segment slope as free parameters. I use getCoordinates function to get starting and ending point of the segment from these 3 parameters and the length.
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
import math as m
from scipy.spatial import distance
# Plot the points and the segment
def plotFunction(points,x1,x2):
'Plotting function for plane and iterations'
plt.xlim(0, 1)
plt.ylim(0, 1)
# Get the sum of the distance between all the points and the segment
# The segment is defined by guess and length were:
# guess[0]=x coordinate of the starting point
# guess[1]=y coordinate of the starting point
# guess[2]=slope
# Since distance is always >0 no need to use root mean square values
def getDist(guess,points,length):
# Loop over each points to get the distance between the point and the segment
for pt in points:
# Return minimum distance between line segment x1-x2 and point pt
# Adapted from https://stackoverflow.com/questions/849211/shortest-distance-between-a-point-and-a-line-segment
def minimum_distance(x1, x2, pt,length):
length2 = length**2 # i.e. |x1-x2|^2 - avoid a sqrt, we use length that we already know to avoid re-computation
if length2 == 0.0:
return distance.euclidean(p, v);
# Consider the line extending the segment, parameterized as x1 + t (x2 - x1).
# We find projection of point p onto the line.
# It falls where t = [(pt-x1) . (x2-x1)] / |x2-x1|^2
# We clamp t from [0,1] to handle points outside the segment vw.
t = max(0, min(1, np.dot(pt - x1, x2 - x1) / length2));
projection = x1 + t * (x2 - x1); # Projection falls on the segment
return distance.euclidean(pt, projection);
# Get coordinates of start and end point of the segment from start_pt,
# slope and length, obtained by solving slope=dy/dx, dx^2+dy^2=length
def getCoordinates(start_pt,slope,length):
return [x1,x2]
if __name__ == '__main__':
# Generate random points
# Starting position
#Use scipy.optimize, minimize to find the best start_pt and slope combination
res = minimize(getDist, x0=[start_pt[0],start_pt[1],slope], args=(points,length), method="Nelder-Mead")
# Retreive best parameters
print("\n** The best segment found is defined by:")
print("\t** start_pt:\t",x1)
print("\t** end_pt:\t",x2)
print("\t** slope:\t",slope)
print("** The total distance is:",getDist([x1[0],x2[1],slope],points,length),"\n")
# Plot results

Is there a way to use function to draw a square with a minimum size over densest region of points?

The Problem:
Using NumPy, I have created an array of random points within a range.
import numpy as np
min_square = 5
positions = (np.random.random(size=(100, 2)) - 0.5) * 2 * container_radius
Where container_radius is an integer and min_square is an integer.
Following that, using matplotlib, I plot the points on a graph.
import matplotlib.pyplot as plt
plt.plot(positions[:, 0], positions[:, 1], 'r.')
This graph shows me the distribution of the points in relation to each other.
What I am looking for is a method to implement something similar to or exactly a k-d tree to draw a rectangle over the densest area of the scatter plot with a defined minimum for the size.
This would be done using plt.gca().add_patch(plt.Rectangle((x, y), width=square_size, height=square_side, fill=None where square_side is the defined by the density function and is at least a minimum sizeo of min_square.
Attempts to Solve the Problem:
So far, I have created my own sort of density function that is within my understanding of Python and easy enough to code without lagging my computer too hard.
The solve comes in the form of creating an additional predefined variable intervals which is an integer.
Using what I had so far, I define a function to calculate the densities by checking if the points are within a range of floats.
# clb stands for calculate_lower_bound
def clb(x):
return -1 * container_radius + (x * 2 * container_radius - min_square) / (intervals - 1)
# crd stands for calculate_regional_density
def crd(x, y):
return np.where(np.logical_and(\
np.logical_and(positions[:, 0] >= clb(x), positions[:, 0] < clb(x) + min_square),\
np.logical_and(positions[:, 1] >= clb(y), positions[:, 1] < clb(y) + min_square)))[0].shape[0]
Then, I create a NumPy array of size size=(intervals, intervals) and pass the indices of the array (I have another question about this as I am currently using a quite inefficient method) as inputs into crd(x,y) and store the values in another array called densities. Then using some method, I calculate the maximum value in my densities array and draw the rectangle using some pretty straightforward code that I do not think is necessary to include here as it is not the problem.
What I Looking For:
I am looking for some function, f(x), that computes the dimensions and coordinates of a square encompassing the densest region on a scatterplot graph. The function would have access to all the variables it needs such as positions, min_square, etc. If you could use informative variable names or explain what each variable means, that would be a great help as well.
Other (Potentially) Important Notes:
I am looking for something that gets the job done in a reasonable time. In most scenarios, I am going to be working with around 10000 points and I need to calculate the densest region around 100 times so the function needs to be efficient enough so that the task completes within around 10-20 seconds.
As such, approximations using formulas like the example I have shown are completely valid as long as they implement well and are able to grow the dimensions of the square larger if necessary.

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.
I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Python: Choose the n points better distributed from a bunch of points

I have a numpy array of points in an XY plane like:
I want to select the n points (let's say 100) better distributed from all these points. This is, I want the density of points to be constant anywhere.
Something like this:
Is there any pythonic way or any numpy/scipy function to do this?
#EMS is very correct that you should give a lot of thought to exactly what you want.
There more sophisticated ways to do this (EMS's suggestions are very good!), but a brute-force-ish approach is to bin the points onto a regular, rectangular grid and draw a random point from each bin.
The major downside is that you won't get the number of points you ask for. Instead, you'll get some number smaller than that number.
A bit of creative indexing with pandas makes this "gridding" approach quite easy, though you can certainly do it with "pure" numpy, as well.
As an example of the simplest possible, brute force, grid approach: (There's a lot we could do better, here.)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))
# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000
# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)
# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])
# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
Loosely based on #EMS's suggestion in a comment, here's another approach.
We'll calculate the density of points using a kernel density estimate, and then use the inverse of that as the probability that a given point will be chosen.
scipy.stats.gaussian_kde is not optimized for this use case (or for large numbers of points in general). It's the bottleneck here. It's possible to write a more optimized version for this specific use case in several ways (approximations, special case here of pairwise distances, etc). However, that's beyond the scope of this particular question. Just be aware that for this specific example with 1e5 points, it will take a minute or two to run.
The advantage of this method is that you get the exact number of points that you asked for. The disadvantage is that you are likely to have local clusters of selected points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))
# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)
# Try playing around with this weight. Compare 1/dens, 1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()
# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
Unless you give a specific criterion for defining "better distributed" we can't give a definite answer.
The phrase "constant density of points anywhere" is also misleading, because you have to specify the empirical method for calculating density. Are you approximating it on a grid? If so, the grid size will matter, and points near the boundary won't be correctly represented.
A different approach might be as follows:
Calculate the distance matrix between all pairs of points
Treating this distance matrix as a weighted network, calculate some measure of centrality for each point in the data, such as eigenvalue centrality, Betweenness centrality or Bonacich centrality.
Order the points in descending order according to the centrality measure, and keep the first 100.
Repeat steps 1-4 possibly using a different notion of "distance" between points and with different centrality measures.
Many of these functions are provided directly by SciPy, NetworkX, and scikits.learn and will work directly on a NumPy array.
If you are definitely committed to thinking of the problem in terms of regular spacing and grid density, you might take a look at quasi-Monte Carlo methods. In particular, you could try to compute the convex hull of the set of points and then apply a QMC technique to regularly sample from anywhere within that convex hull. But again, this privileges the exterior of the region, which should be sampled far less than the interior.
Yet another interesting approach would be to simply run the K-means algorithm on the scattered data, with a fixed number of clusters K=100. After the algorithm converges, you'll have 100 points from your space (the mean of each cluster). You could repeat this several times with different random starting points for the cluster means and then sample from that larger set of possible means. Since your data do not appear to actually cluster into 100 components naturally, the convergence of this approach won't be very good and may require running the algorithm for a large number of iterations. This also has the downside that the resulting set of 100 points are not necessarily points that come form the observed data, and instead will be local averages of many points.
This method to iteratively pick the point from the remaining points which has the lowest minimum distance to the already picked points has terrible time complexity, but produces pretty uniformly distributed results:
from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist
def well_spaced_points(points: ndarray, num_points: int):
Pick `num_points` well-spaced points from `points` array.
:param points: An m x n array of m n-dimensional points.
:param num_points: The number of points to pick.
:rtype: ndarray
:return: A num_points x n array of points from the original array.
# pick a random point
current_point_index = randint(0, num_points)
picked_points = array([points[current_point_index]])
remaining_points = vstack((
points[: current_point_index],
points[current_point_index + 1:]
# while there are more points to pick
while picked_points.shape[0] < num_points:
# find the furthest point to the current point
distance_pk_rmn = cdist(picked_points, remaining_points)
min_distance_pk = distance_pk_rmn.min(axis=0)
i_furthest = argmax(min_distance_pk)
# add it to picked points and remove it from remaining
picked_points = vstack((
remaining_points = vstack((
remaining_points[: i_furthest],
remaining_points[i_furthest + 1:]
return picked_points

