Bin average as a function of position - python

I want to efficiently calculate the average of a variable (say temperature) over multiple areas of the plane.
I essentially want to do the following.
import numpy as np
num = 10000
XYT = np.random.uniform(0, 1, (num, 3))
X = np.transpose(XYT)[0]
Y = np.transpose(XYT)[1]
T = np.transpose(XYT)[2]
size = 10
bins = np.empty((size, size))
for i in range(size):
for j in range(size):
if rescaled X,Y in bin[i][j]:
bins[i][j] = mean T

I would use pandas (although im sure you can achieve basically the same with vanilla numpy)
df = pandas.DataFrame({'x':npX,'y':npY,'z':npZ})
# solve quadrants
df['quadrant'] = (df['x']>=0)*2 + (df['y']>=0)*1
# group by and aggregate
mean_per_quadrant = df.groupby(['quadrant'])['temp'].aggregate(['mean'])
you may need to create multiple quadrant cutoffs to get unique groupings
for example (df['x']>=50)*4 + (df['x']>=0)*2 + (df['y']>=0)*1 would add an extra 2 quadrants to our group (one y>=0, and one y<0) (just make sure you use powers of 2)

Related

Is it possible/ good practice to plot a pandas column containing tuples representing coordinates (x, y)?

I am building a simple simulation of a surf ride where at any given second, the surfer has coordonates (x, y) on a 2 dimensional plan.
I defined a function generating a list of tuple coordonates for a 100 seconds ride :
def takeof(vague, surfer):
positions = [(0,0)]
positionV.y = 0
positionS.y = 0
positionS.x = 0
for i in range (1, 100):
positionV.y = positionV.y + vague.vitesse + random.choice([0.5, 0, -0.5])
positionS.y = positionV.y
positionS.x = positionS.x + surfer.vitesse + random.choice([0.25, 0, -0.25])
moveto = (positionS.x, positionS.y)
positions.append(moveto)
return positions
Then in order to plot 10 simulated waves on a same graph, I defined another function assuming my data would be easier to work with inside a pandas DataFrame :
def sim(vague, surfer):
dflist = pd.DataFrame()
convertisseur = {1:'1', 2:'2', 3:'3', 4:'4', 5:'5', 6:'6', 7:'7', 8:'8', 9:'9', 10:'10'}
for n in range (1,10):
coordonnees = takeof(normale, pablo)
dflist[convertisseur[n]] = coordonnees
print(coordonnees)
print(dflist[convertisseur[n]])
print(dflist)
The DataFrame gets created fine with each column being a series of tuple coordinates, however I struggle in my attempts to plot all 10 waves in a single graph. Is there an obvious way I can't see? Is it generally a bad idea to proceed this way when simulating coordinates?

Having trouble getting timeit to run with numpy

I simply want to see how long it takes this code to execute. There is a similar question here:
timeit module in python does not recognize numpy module
and I understand what they are saying, but I don't get where these lines of code should be placed. Here is what I have. I know its a little long to scroll through, but you can see where I have placed the timeit commands at the beginning and end. This is not working and I am guessing it is because I have placed these lines of code for timeit incorrectly. The code works if I delete the timeit stuff.
Thanks
import timeit
u = timeit.Timer("np.arange(1000)", setup = 'import numpy as np')
#set up variables
m = 4.54
g = 9.81
GR = 8
r_pulley = .1
th1=np.pi/4 #based on motor 1 encoder counts. Number of degrees rotated from + x-axis of base frame 0
th2=np.pi/4 #based on motor 2 encoder counts. Number of degrees rotated from + x-axis of m1 frame 1
th3_motor = np.pi/4*12
th3_pulley = th3_motor/GR
#required forces in x,y,z at end effector
fx = 1
fy = 1
fz = m*g #need to figure this out
l1=6
l2=5
l3=th3_pulley*r_pulley
#Build Homogeneous Tranforms Matrices
H1_0 = np.array(([np.cos(th1),-np.sin(th1),0,0],[np.sin(th1),np.cos(th1),0,0],[0,0,1,l3],[0,0,0,1]))
H2_1 = np.array(([np.cos(th2),-np.sin(th2),0,l1],[np.sin(th2),np.cos(th2),0,0],[0,0,1,0],[0,0,0,1]))
H3_2 = np.array(([1,0,0,l2],[0,1,0,0],[0,0,1,0],[0,0,0,1]))
H2_0 = np.dot(H1_0,H2_1)
H3_0 = np.dot(H2_0,H3_2)
print(np.matrix(H3_0))
#These HTMs are using the way I derived them, not the "correct" way.
#The answers are the same, but I think the processing time will be the same.
#This is because either way the two matrices with all the sines and cosines...
#will be the same. Only difference is in one method the ones and zeroes...
#matrix is the first HTM, in the other method it is the last HTM. So its the...
#same number of matrices with the same information, just being dot-producted...
#in a different order.
#Build Jacobian
#np.cross(x, y)
d10 = H1_0[0:3, 3]
d20 = H2_0[0:3, 3]
d30 = H3_0[0:3, 3]
print(d30)
subt1 = d30-d10
subt2 = d30-d20
#tsubt1 = subt1.transpose()
#tsubt2 = subt2.transpose()
#print(tsubt1)
zeroes = np.array(([0,0,1]))
print(subt1)
print(subt2)
cross1 = np.cross(zeroes, subt1)
cross2 = np.cross(zeroes, subt2)
cross1
cross2
#These cross products are correct but need to be tranposed into columns, right now they are a single row.
#tcross1=cross1.reshape(-1,1)
#tcross2=cross2.reshape(-1,1)
#dont actually need these transposes but I didnt want to forget the command.
# build jacobian (J)
#J = np.zeros((6,2))
#J[0:3,0] = cross1
#J[0:3,1] = cross2
#J[3:6,0] = zeroes
#J[3:6,1] = zeroes
#J
#find torques
J_force = np.zeros((2,3))
J_force[0,:]=cross1
J_force[1,:]=cross2
J_force
#build force matrix
forces = np.array(([fx],[fy],[fz]))
forces
torques = np.dot(J_force,forces)
torques #top number is theta 1 (M1) and bottom number is theta 2 (M2)
#need to add z axis?
print(u.timeit())
# u is a timer eval np.arange(1000)
u = timeit.Timer("np.arange(1000)", setup = 'import numpy as np')
# print how many seconds needed to run np.arange(1000) 1000000 times
# 1000000 is the default value, you can set by passing a int here.
print(u.timeit())
So the following is what you want.
import timeit
def main():
#set up variables
m = 4.54
g = 9.81
GR = 8
r_pulley = .1
th1=np.pi/4 #based on motor 1 encoder counts. Number of degrees rotated from + x-axis of base frame 0
th2=np.pi/4 #based on motor 2 encoder counts. Number of degrees rotated from + x-axis of m1 frame 1
th3_motor = np.pi/4*12
th3_pulley = th3_motor/GR
#required forces in x,y,z at end effector
fx = 1
fy = 1
fz = m*g #need to figure this out
l1=6
l2=5
l3=th3_pulley*r_pulley
#Build Homogeneous Tranforms Matrices
H1_0 = np.array(([np.cos(th1),-np.sin(th1),0,0],[np.sin(th1),np.cos(th1),0,0],[0,0,1,l3],[0,0,0,1]))
H2_1 = np.array(([np.cos(th2),-np.sin(th2),0,l1],[np.sin(th2),np.cos(th2),0,0],[0,0,1,0],[0,0,0,1]))
H3_2 = np.array(([1,0,0,l2],[0,1,0,0],[0,0,1,0],[0,0,0,1]))
H2_0 = np.dot(H1_0,H2_1)
H3_0 = np.dot(H2_0,H3_2)
print(np.matrix(H3_0))
#These HTMs are using the way I derived them, not the "correct" way.
#The answers are the same, but I think the processing time will be the same.
#This is because either way the two matrices with all the sines and cosines...
#will be the same. Only difference is in one method the ones and zeroes...
#matrix is the first HTM, in the other method it is the last HTM. So its the...
#same number of matrices with the same information, just being dot-producted...
#in a different order.
#Build Jacobian
#np.cross(x, y)
d10 = H1_0[0:3, 3]
d20 = H2_0[0:3, 3]
d30 = H3_0[0:3, 3]
print(d30)
subt1 = d30-d10
subt2 = d30-d20
#tsubt1 = subt1.transpose()
#tsubt2 = subt2.transpose()
#print(tsubt1)
zeroes = np.array(([0,0,1]))
print(subt1)
print(subt2)
cross1 = np.cross(zeroes, subt1)
cross2 = np.cross(zeroes, subt2)
cross1
cross2
#These cross products are correct but need to be tranposed into columns, right now they are a single row.
#tcross1=cross1.reshape(-1,1)
#tcross2=cross2.reshape(-1,1)
#dont actually need these transposes but I didnt want to forget the command.
# build jacobian (J)
#J = np.zeros((6,2))
#J[0:3,0] = cross1
#J[0:3,1] = cross2
#J[3:6,0] = zeroes
#J[3:6,1] = zeroes
#J
#find torques
J_force = np.zeros((2,3))
J_force[0,:]=cross1
J_force[1,:]=cross2
J_force
#build force matrix
forces = np.array(([fx],[fy],[fz]))
forces
torques = np.dot(J_force,forces)
torques #top number is theta 1 (M1) and bottom number is theta 2 (M2)
#need to add z axis?
u = timeit.Timer(main)
print(u.timeit(5))

Moving window averages on unequal dimensions

TL;DR: Is there anyway I can get rid of my second for-loop?
I have a time series of points on a 2D-grid. To get rid of fast fluctuations of their position, I average the coordinates over a window of frames. Now in my case, it's possible for the points to cover a larger distance than usual. I don't want to include frames for a specific point, if it travels farther than the cut_off value.
In the first for-loop, I go over all frames and define the moving window. I then calculate the distances between the current frame and each frame in the moving window. After I grab only those positions from all frames, where both the x and y component did not travel farther than cut_off. Now I want to calculate the mean positions for every point from all these selected frames of the moving window (note: the number of selected frames can be smaller than n_window). This leads me to the second for-loop. Here I iterate over all points and actually grab the positions from the frames, in which the current point did not travel farther than cut_off. From these selected frames I calculate the mean value of the coordinates and use it as the new value for the current frame.
This very last for-loop slows down the whole processing. I can't come up with a better way to accomplish this calculation. Any suggestions?
MWE
Put in comments for clarification.
import numpy as np
# Generate a timeseries with 1000 frames, each
# containing 50 individual points defined by their
# x and y coordinates
n_frames = 1000
n_points = 50
n_coordinates = 2
timeseries = np.random.randint(-100, 100, [n_frames, n_points, n_coordinates])
# Set window size to 20 frames
n_window = 20
# Distance cut off
cut_off = 60
# Set up empty array to hold results
avg_data_store = np.zeros([n_frames, timeseries.shape[1], 2])
# Iterate over all frames
for frame in np.arange(0, n_frames):
# Set the frame according to the window size that we're looking at
t_before = int(frame - (n_window / 2))
t_after = int(frame + (n_window / 2))
# If we're trying to access frames below 0, set the lowest one to 0
if t_before < 0:
t_before = 0
# Trying to access frames that are not in the trajectory, set to last frame
if t_after > n_frames - 1:
t_after = n_frames - 1
# Grab x and y coordinates for all points in the corresponding window
pos_before = timeseries[t_before:frame]
pos_after = timeseries[frame + 1:t_after + 1]
pos_now = timeseries[frame]
# Calculate the distance between the current frame and the windows before/after
d_before = np.abs(pos_before - pos_now)
d_after = np.abs(pos_after - pos_now)
# Grab indices of frames+points, that are below the cut off
arg_before = np.argwhere(np.all(d_before < cut_off, axis=2))
arg_after = np.argwhere(np.all(d_after < cut_off, axis=2))
# Iterate over all points
for i in range(0, timeseries.shape[1]):
# Create temp array
temp_stack = pos_now[i]
# Grab all frames in which the current point did _not_
# travel farther than `cut_off`
all_before = arg_before[arg_before[:, 1] == i][:, 0]
all_after = arg_after[arg_after[:, 1] == i][:, 0]
# Grab the corresponding positions for this points in these frames
all_pos_before = pos_before[all_before, i]
all_pos_after = pos_after[all_after, i]
# If we have any frames for that point before / after
# stack them into the temp array
if all_pos_before.size > 0:
temp_stack = np.vstack([all_pos_before, temp_stack])
if all_pos_after.size > 0:
temp_stack = np.vstack([temp_stack, all_pos_after])
# Calculate the moving window average for the selection of frames
avg_data_store[frame, i] = temp_stack.mean(axis=0)
If you are fine with calculating the cutoff distance in x and y separately, you can use scipy.ndimage.generic_filter.
import numpy as np
from scipy.ndimage import generic_filter
def _mean(x, cutoff):
is_too_different = np.abs(x - x[len(x) / 2]) > cutoff
return np.mean(x[~is_too_different])
def _smooth(x, window_length=5, cutoff=1.):
return generic_filter(x, _mean, size=window_length, mode='nearest', extra_keywords=dict(cutoff=cutoff))
def smooth(arr, window_length=5, cutoff=1., axis=-1):
return np.apply_along_axis(_smooth, axis, arr, window_length=window_length, cutoff=cutoff)
# --------------------------------------------------------------------------------
def _simulate_movement_2d(T, fraction_is_jump=0.01):
# generate random velocities with a few "jumps"
velocity = np.random.randn(T, 2)
is_jump = np.random.rand(T) < fraction_is_jump
jump = 10 * np.random.randn(T, 2)
jump[~is_jump] = 0.
# pre-allocate position and momentum arrays
position = np.zeros((T,2))
momentum = np.zeros((T,2))
# initialise the first position
position[0] = np.random.randn(2)
# update position using velocity vector:
# smooth movement by not applying the velocity directly
# but rather by keeping track of the momentum
for ii in range(2,T):
momentum[ii] = 0.9 * momentum[ii-1] + 0.1 * velocity[ii-1]
position[ii] = position[ii-1] + momentum[ii] + jump[ii]
# add some measurement noise
noise = np.random.randn(T,2)
position += noise
return position
def demo(nframes=1000, npoints=3):
# create data
positions = np.array([_simulate_movement_2d(nframes) for ii in range(npoints)])
# format to (nframes, npoints, 2)
position = positions.transpose([1, 0, 2])
# smooth
smoothed = smooth(positions, window_length=11, cutoff=5., axis=1)
# plot
x, y = positions.T
xs, ys = smoothed.T
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
ax.plot(x, y, 'o')
ax.plot(xs, ys, 'k-', alpha=0.3, lw=2)
plt.show()
demo()

What is the most efficient method for accessing and manipulating a pandas df

I am working on an agent based modelling project and have a 800x800 grid that represents a landscape. Each cell in this grid is assigned certain variables. One of these variables is 'vegetation' (i.e. what functional_types this cell posses). I have a data fame that looks like follows:
Each cell is assigned a landscape_type before I access this data frame. I then loop through each cell in the 800x800 grid and assign more variables, so, for example, if cell 1 is landscape_type 4, I need to access the above data frame, generate a random number for each functional_type between the min and max_species_percent, and then assign all the variables (i.e. pollen_loading, succession_time etc etc) for that landscape_type to that cell, however, if the cumsum of the random numbers is <100 I grab function_types from the next landscape_type (so in this example, I would move down to landscape_type 3), this continues until I reach a cumsum closer to 100.
I have this process working as desired, however it is incredibly slow - as you can imagine, there are hundreds of thousands of assignments! So far I do this (self.model.veg_data is the above df):
def create_vegetation(self, landscape_type):
if landscape_type == 4:
veg_this_patch = self.model.veg_data[self.model.veg_data['landscape_type'] <= landscape_type].copy()
else:
veg_this_patch = self.model.veg_data[self.model.veg_data['landscape_type'] >= landscape_type].copy()
veg_this_patch['veg_total'] = veg_this_patch.apply(lambda x: randint(x["min_species_percent"],
x["max_species_percent"]), axis=1)
veg_this_patch['cum_sum_veg'] = veg_this_patch.veg_total.cumsum()
veg_this_patch = veg_this_patch[veg_this_patch['cum_sum_veg'] <= 100]
self.vegetation = veg_this_patch
I am certain there is a more efficient way to do this. The process will be repeated constantly, and as the model progresses, landscape_types will change, i.e. 3 become 4. So its essential this become as fast as possible! Thank you.
As per the comment: EDIT.
The loop that creates the landscape objects is given below:
for agent, x, y in self.grid.coord_iter():
# check that patch is land
if self.landscape.elevation[x,y] != -9999.0:
elevation_xy = int(self.landscape.elevation[x, y])
# calculate burn probabilities based on soil and temp
burn_s_m_p = round(2-(1/(1 + (math.exp(- (self.landscape.soil_moisture[x, y] * 3)))) * 2),4)
burn_s_t_p = round(1/(1 + (math.exp(-(self.landscape.soil_temp[x, y] * 1))) * 3), 4)
# calculate succession probabilities based on soil and temp
succ_s_m_p = round(2 - (1 / (1 + (math.exp(- (self.landscape.soil_moisture[x, y] * 0.5)))) * 2), 4)
succ_s_t_p = round(1 / (1 + (math.exp(-(self.landscape.soil_temp[x, y] * 1))) * 0.5), 4)
vegetation_typ_xy = self.landscape.vegetation[x, y]
time_colonised_xy = self.landscape.time_colonised[x, y]
is_patch_colonised_xy = self.landscape.colonised[x, y]
# populate landscape patch with values
patch = Landscape((x, y), self, elevation_xy, burn_s_m_p, burn_s_t_p, vegetation_typ_xy,
False, time_colonised_xy, is_patch_colonised_xy, succ_s_m_p, succ_s_t_p)
self.grid.place_agent(patch, (x, y))
self.schedule.add(patch)
Then, in the object itself I call the create_vegetation function to add the functional_types from the above df. Everything else in this loop comes from a different dataset so isn't relevant.
You need to extract as many calculations as you can into a vectorized preprocessing step. For example in your 800x800 loop you have:
burn_s_m_p = round(2-(1/(1 + (math.exp(- (self.landscape.soil_moisture[x, y] * 3)))) * 2),4)
Instead of executing this line 800x800 times, just do it once, during initialization:
burn_array = np.round(2-(1/(1 + (np.exp(- (self.landscape.soil_moisture * 3)))) * 2),4)
Now in your loop it is simply:
burn_s_m_p = burn_array[x, y]
Apply this technique to the rest of the similar lines.

Panda dataframe column cut - add more bins more frequently around the mean

I am categorizing quantitative variable (e.g. price) and I would like to categorize it in the manner that the bins would be much more frequent around the mean and less when away from the mean.
I have seen that there are possibilities to cut() in linear manner and thanks to numpy.logspace in logarithmic manner, but binning around the mean seems to be void and my ideas so far haven't worked and seem to be inefficient.
You can make bins that increase in size linearly:
import numpy as np
def make_progressive_bins(min_x, max_x, mean_x, num_bins=10):
x_rel_lim = max(mean_x - min_x, mean_x - max_x)
num_bins_half = num_bins // 2
bins_right = np.arange(0, num_bins_half + 1)
if num_bins % 2 == 1:
bins_right = bins_right + 0.5
bins_right = np.cumsum(bins_right)
bins = np.concatenate([-bins_right[bins_right > 0][::-1], bins_right])
bins = bins * (float(x_rel_lim) / bins[-1]) + mean_x
return bins
And then you can use it like:
import numpy as np
import matplotlib.pyplot as plt
bins = make_progressive_bins(-20, 50, 10, 15)
plt.bar(bins - 0.1, np.ones_like(bins), 0.2)
I made a script that might do what you want to achieve, but I'm not sure how to convert the resulted cut object into a histogram to see if it does what i want it to do, so please check and tell me if it works :).
# Make normally distributed price with mean 50.
df = pd.DataFrame(data=np.random.normal(50, size=1000), columns=['price'])
df.hist(bins=30)
num_bins = 100
# I used a square function to distribute the bins more around 0 and
# less at the outskirts of the range.
shape_func = lambda x: x**2
bin_loc = [shape_func(i) for i in range(num_bins//2)]
mirrored_bin_loc = [-x for x in bin_loc[::-1]]
bin_loc = mirrored_bin_loc + bin_loc[1:]
# Rescale and translate bins
data_mean = df.price.mean()
data_range = df.price.max() - df.price.min()
final_bin_loc = [(x + data_mean) / (data_range * num_bins) for x in bin_loc]
# display(final_bin_loc)
binned = pd.cut(df.price, bin_loc)

Categories

Resources