"Sensibly" remove points in a Python list - python

Suppose I have two arrays indicating the x and y coordinates of a calibration curve.
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
Y = [2,4,6,8,10,12,14,16,18,20,24,28,32,36,40,60,80,100]
My example arrays above contain 18 points. You'll notice that the x values are not linearly spaced; there are more points at lower values of x.
Let's suppose I need to reduce the number of points in my calibration curve to 13 points. Obviously, I could just remove the first five or the last five points, but that would shorten my overall range of x values. To maintain range and minimise the space between x values I would preferentially remove values x= 2,4,6,8,10. Removing these x points and their respective y values would leave 13 points in the curve as required.
How could I do this point selection and removal automatically in Python? I.e. Is there an algorithm to pick the best x points from a list, where "best" is defined as keeping the points as close as possible while keeping the overall range and adhering to the new number of points.
Please note that the points remaining must be in the original lists, so I can't interpolate the 18 points on to a 13 point grid.

This would maximize the square root distances between the chosen points. It in some sense spreads the points as far as possible.
import itertools
list(max(itertools.combinations(sorted(X), 13), i
key=lambda l: sum((a - b) ** 2 for a, b in zip(l, l[1:]))))
Note that this is only feasible for small problems. The time complexity for selecting k points is O(k * (len(X) choose k)), so basically O(exp(len(X)). So don't even think about using this for, e.g., len(X) == 100 and k == 10.

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 30, 40, 50]
Y = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 28, 32, 36, 40, 60, 80, 100]
assert len(X) == len(set(X)), "Duplicate X values found"
points = list(zip(X, Y))
points.sort() # sorts by X
while len(points) > 13:
# Find index whose neighbouring X values are closest together
i = min(range(1, len(points) - 1), key=lambda p: points[p + 1][0] - points[p - 1][0])
points.pop(i)
print(points)
Output:
[(1, 2), (3, 6), (5, 10), (7, 14), (10, 20), (12, 24), (14, 28), (16, 32), (18, 36), (20, 40), (30, 60), (40, 80), (50, 100)]
If you want the original series again:
X, Y = zip(*points)

An algorithm that would achieve that:
Convert each number into the sum of the absolute difference to the number to the left and to the right. If a number is missing, first or last cases, then use MAX_INT. For example, 1 would become MAX_INT; 2 would become 2, 10 would become 3.
Remove the first case with the lowest sum.
If you need to remove more numbers, go to 1.
This would remove 2,4,6,8,10,3,...

Here is a recursive approach that repeatedly removes the point which will be the least missed:
def mostRedundantPoint(x):
#returns the index, i, in the range 0 < i < len(x) - 1
#that minimizes x[i+1] - x[i-1]
#assumes len(x) > 2 and that x
#is sorted in ascending order
gaps = [x[i+1] - x[i-1] for i in range(1,len(x)-1)]
i = gaps.index(min(gaps))
return i+1
def reduceList(x,k):
if len(x) <= k:
return x
else:
i = mostRedundantPoint(x)
return reduceList(x[:i]+x[i+1:],k)
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
print(reduceList(X,13))
#prints [1, 3, 5, 7, 10, 12, 14, 16, 18, 20, 30, 40, 50]
This list essentially agrees with your intended output since 7 vs. 8 have the same net effect. It is reasonably quick in the sense that it is almost instantaneous in reducing sorted([random.randint(1,10**6) for i in range(1000)]) from 1000 elements to 100 elements. The fact that it is recursive implies that it will blow the stack if you try to remove many more points than that, but with what seems to be your intended problem size that shouldn't be an issue. If need be, you could of course replace the recursion by a loop.

Related

How to evaluate columns that contain lists in pandas?

Say I have a dataframe which describes the dimensions of hundreds of cardboard boxes:
df = [[17829292, (13, 14, 20)], [17739292, (20, 10, 15)], [17827792, (10, 10, 12)]]
df = pd.DataFrame(df, columns = ['Serial Number', 'Box Dimensions'])
Given that the 'Box Dimensions' column contains a list of numbers, is there a simple way to search through the dataframe for all boxes with dimensions less than or equal to certain values, <= (15, 15, 15)? Or do I need to separate out each box's (x, y, z) values into three separate columns to evaluate them this way?
I've tried df.loc[df['Box Dimensions'] <= (15, 15, 15)], but for some reason this only evaluates the first number in the list. The function will return boxes whose x values are <=15, but whose y and z dimensions exceed these parameters.
You can try:
m=pd.DataFrame(df['Box Dimensions'].tolist()).le(15).all(1)
#OR(If needed to check dimensions seperately above method can be modified as well)
m=df['Box Dimensions'].map(lambda x:x[0]<=15 and x[1]<=15 and x[2]<=15)
#Finally:
df[m]
#OR
df.loc[m]
output of above code:
Serial Number Box Dimensions
2 17827792 (10, 10, 12)
Alternatively if you need to test box dimensions independently
df.loc[(df['Box Dimensions'].str[0] <= 15) & (df['Box Dimensions'].str[1] <= 15) & (df['Box Dimensions'].str[2] <= 15)]
Serial Number Box Dimensions
2 17827792 (10, 10, 12)

Renumber/Relabel a Numpy array based on coordinates

I have a segmentation map (numpy.ndarray) that contain objects labeled with unique numbers. I want to combine objects across multiple slices by labeling them with the same number. Specifically, I want to renumber objects based on a DataFrame containing centroid positions and the desired label value.
First, I created some mock labels and a DataFrame:
df = pd.DataFrame({
"slice": [0, 0, 0, 0, 1, 1, 1, 2, 2, 2],
"number": [1, 2, 3, 4, 1, 2, 3, 1, 2, 3],
"x": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32],
"y": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32]
})
def make_segmap(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice in df["slice"].unique():
masks = []
for row in df[df["slice"] == n_slice].iterrows():
# Create circle
mask_circle = (x - row[1]["x"])**2 + (y - row[1]["y"])**2 < 5**2
# Random index number (here just a multiple)
masks.append(mask_circle * row[1]["number"]*3)
maps.append(np.max(masks, axis=0))
return np.stack(maps, axis=0)
segmap = make_segmap(df)
For renumbering, this is what I came up with so far:
new_maps = []
# Iterate over slices
for n_slice in df["slice"].unique():
new_labels = []
for row in df[df["slice"] == n_slice].iterrows():
# Find current value at position
original_label = segmap[n_slice, row[1]["y"], row[1]["x"]]
# Replace all label occurrences with the desired label from the DataFrame
replaced_label = np.where(segmap[n_slice] == original_label, row[1]["number"], 0)
new_labels.append(replaced_label)
new_maps.append(np.max(new_labels, axis=0))
new_segmap = np.stack(new_maps, axis=0)
This works reasonably well but doesn't scale to larger datasets. The real dataset has thousands of objects across hundreds of slices and this approach takes very long to run (an hour or so). Are there any suggestions on how to replace multiple values at once to improve performance?
Thanks in advance.
You can use groupby to replace the current quadratic search algorithm by a (quasi) linear search. Moreover, you can take advantage of Numpy's vectorization and broadcasting to remove the inner loop and make the computation faster.
Here is a faster implementation:
def make_segmap_fast(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice,subDf in df.groupby("slice"):
subDf_x = subDf["x"].to_numpy()[:, None, None]
subDf_y = subDf["y"].to_numpy()[:, None, None]
subDf_number = subDf["number"].to_numpy()[:, None, None]
# Create circle
mask_circle = (x - subDf_x)**2 + (y - subDf_y)**2 < 5**2
# Random index number (here just a multiple)
masks = mask_circle * subDf_number
maps.append(np.max(masks, axis=0)*3)
return np.stack(maps, axis=0)
On my machine, this is 2 times faster on the very small example (much more on bigger dataframes).

how to select some lines created by points based on their distances in python

I have some lines created by connecting points of a regular grid and want to pair the correct lines to create surfces. This is coordinates of my point array:
coord=np.array([[0.,0.,2.], [0.,1.,3.], [0.,2.,2.], [1.,0.,1.], [1.,1.,3.],\
[1.,2.,1.], [2.,0.,1.], [2.,1.,1.], [3.,0.,1.], [4.,0.,1.]])
Then, I created lines by connecting points. My points are from a regular grid. So, I have two perpendicular sets of lines. I called them blue (vertical) and red (horizontal) lines. To do so:
blue_line=[]
for ind, i in enumerate (range (len(coord)-1)):
if coord[i][0]==coord[i+1][0]:
line=[ind, ind+1]
# line=[x+1 for x in line]
blue_line.append(line)
threshold_x = 1.5
threshold_y = 1.5
i, j = np.where((coord[:, 1] == coord[:, np.newaxis, 1]) &
(abs(coord[:, 0] - coord[:, np.newaxis, 0]) < 1.2 * threshold_y))
# Restrict to where i is before j
i, j = i[i < j], j[i < j]
# Combine and print the indices
red_line=np.vstack([i, j]).T
blue_line=np.array(blue_line)
red_line=np.array(red_line)
all_line=np.concatenate((blue_line, red_line), axis=0)
To find the correct lines for creating surfaces, I check the center of each line with the adjacent ones. I start from the first blue line and try if there are other three adjacent lines or not. If I find any line which its center is less than threshold_x and also its x coordinate is different from that line, I will keep it as a pair. Then I continue searching for adjacent lines with this rule. My fig clearly shows it. First blue line is connected by an arrow to the blue line numbered 3 and also red lines numbered 6 and 7. It is not paired with blue line numbered 2 because they have the same x coordinate. I tried the following but it was not successful to do all the things and I coulnot solve it:
ave_x=[]
ave_y=[]
ave_z=[]
for ind, line in enumerate (all_line):
x = (coord[line][0][0]+coord[line][1][0])/2
ave_x.append (x)
y = (coord[line][0][1]+coord[line][1][1])/2
ave_y.append (y)
z = (coord[line][0][2]+coord[line][1][2])/2
ave_z.append (z)
avs=np.concatenate((ave_x, ave_y, ave_z), axis=0)
avs=avs.reshape(-1,len (ave_x))
avs_f=avs.T
blue_red=[len (blue_line), len (red_line)]
avs_split=np.split(avs_f,np.cumsum(blue_red))[:-1] # first array is center of
# blue lines and second is center of red lines
dists=[]
for data in avs_split:
for ind, val in enumerate (data):
if ind < len(data):
for ind in range (len(data)-1):
squared_dist = np.sum((data[ind]-data[ind+1])**2, axis=0)
dists.append (squared_dist)
In fact I expect my code to give me the resulting list as the pairs of the lines the create three surfaces:
[(1, 6, 3, 7), (2, 7, 4, 8), (3, 9, 5, 10)]
At the end, I want to find the number of lines which are not used in creating the surfaces or are used but are closer than a limit the the dashed line in my fig. I have the coordinate of the two points creating that dashed line:
coord_dash=np.array([[2., 2., 2.], [5., 0., 1.]])
adjacency_threshold=2
These line numbers are also:
[4, 10, 5, 11, 12]
In advance I do appreciate any help.
I'm not sure my answer is what you are looking for because your question is a bit unclear. To start off I create the blue and red lines as dictionaries, where the keys are the line numbers and the values are tuples with the star and end point numbers. I also create a dictionary all_mid where the key is the line number and the value is a pandas Series with the coordinates of the mid point.
import numpy as np
import pandas as pd
coord = np.array([[0.,0.,2.], [0.,1.,3.], [0.,2.,2.], [1.,0.,1.], [1.,1.,3.],
[1.,2.,1.], [2.,0.,1.], [2.,1.,1.], [3.,0.,1.], [4.,0.,1.]])
df = pd.DataFrame(
data=sorted(coord, key=lambda item: (item[0], item[1], item[2])),
columns=['x', 'y', 'z'],
index=range(1, len(coord) + 1))
count = 1
blue_line = {}
for start, end in zip(df.index[:-1], df.index[1:]):
if df.loc[start, 'x'] == df.loc[end, 'x']:
blue_line[count] = (start, end)
count += 1
red_line = []
index = df.sort_values('y').index
for start, end in zip(index[:-1], index[1:]):
if df.loc[start, 'y'] == df.loc[end, 'y']:
red_line.append((start, end))
red_line = {i + count: (start, end)
for i, (start, end) in enumerate(sorted(red_line))}
all_line = {**blue_line, **red_line}
all_mid = {i: (df.loc[start] + df.loc[end])/2
for i, (start, end) in all_line.items()}
The lines look like this:
In [875]: blue_line
Out[875]: {1: (1, 2), 2: (2, 3), 3: (4, 5), 4: (5, 6), 5: (7, 8)}
In [876]: red_line
Out[876]:
{6: (1, 4),
7: (2, 5),
8: (3, 6),
9: (4, 7),
10: (5, 8),
11: (7, 9),
12: (9, 10)}
Then I define some utility functions:
adjacent returns True if the input points are adjacent.
left_to_right returns True if the x coordinate of the first point is less than the x coordinate of the second point.
connections returns a dictionary in which the key is a line number and the value is a list with the line numbers connected to it.
def adjacent(p, q, threshold=1):
dx = abs(p['x'] - q['x'])
dy = abs(p['y'] - q['y'])
dxy = np.sqrt(dx**2 + dy**2)
return np.max([dx, dy, dxy]) <= threshold
def left_to_right(p, q):
return p['x'] < q['x']
def connections(midpoints, it):
mapping = {}
for start, end in it:
if adjacent(midpoints[start], midpoints[end]):
if left_to_right(midpoints[start], midpoints[end]):
if start in mapping:
if end not in mapping[start]:
mapping[start].append(end)
else:
mapping[start] = [end]
return mapping
We are now ready to create a list of lists, in which each sublist has the line numbers that make up a surface:
from itertools import product, combinations
blues = blue_line.keys()
reds = red_line.keys()
blue_to_red = connections(all_mid, product(blues, reds))
blue_to_blue = connections(all_mid, combinations(blues, r=2))
surfaces = []
for start in blue_line:
red_ends = blue_to_red.get(start, [])
blue_ends = blue_to_blue.get(start, [])
if len(red_ends) == 2 and len(blue_ends) == 1:
surfaces.append(sorted([start] + red_ends + blue_ends))
This is what you get:
In [879]: surfaces
Out[879]: [[1, 3, 6, 7], [2, 4, 7, 8], [3, 5, 9, 10]]

Conditional minimum of 4D matrix: find minimizing vectors

I'm guessing about the way to conditionally minimize the 4D matrix.
Let's start creating some toy data (which is close to my real-world problem):
import numpy as np
t = np.arange(1960,1981,1)
N = np.arange(0,3,1)
k = np.arange(0,5,0.1)
k_matrix = ( np.tile(k,(len(N),1)).T * (N+1)/(N+2) ).T
p = np.arange(0.1,2.01,0.1)
theory = np.random.normal(10,1,[len(N),len(t),len(p)])
res2 = np.zeros([len(N),len(t),len(k),len(p)])
def calc_res2(N,t,k_matrix,p,theory):
for N_ind, N_val in enumerate(N):
for t_ind, t_val in enumerate(t):
for k_ind, k_val in enumerate(k_matrix[N_ind]):
for p_ind, p_val in enumerate(p):
res2[N_ind,t_ind,k_ind,p_ind] = (N_val*t_val-k_val*theory[N_ind,t_ind,p_ind])**2
return res2
test = calc_res2(N,t,k_matrix,p,theory)
I want to find such indices/values of k_matrix (as a function of N) and p (as a function of t) that
test sum over t and N will be minimal.
Now I see that this problem can be solved using for cycles:
def k_multi_N (test,k_matrix,p):
SUM_best = 1e99
k0i_b,k1i_b,k2i_b = 0,0,0
for k0_ind,k0 in enumerate(k_matrix[0]):
temp = test[0,:,k0_ind,:]
for k1_ind,k1 in enumerate(k_matrix[1]):
temp += test[1,:,k1_ind,:]
for k2_ind,k2 in enumerate(k_matrix[2]):
temp += test[2,:,k2_ind,:]
SUM = sum(temp.min(axis=1))
if SUM < SUM_best:
SUM_best = SUM
p_min_ind = np.argmin(temp,axis=1)
k0i_b,k1i_b,k2i_b = k0_ind,k1_ind,k2_ind
temp -= test[2,:,k2_ind,:]
temp -= test[1,:,k1_ind,:]
temp -= test[0,:,k0_ind,:]
return p_min_ind, (k0i_b,k1i_b,k2i_b)
k_multi_N (test,k_matrix,p)
So the expected output is:
(array([12, 16, 14, 8, 14, 18, 1, 18, 9, 9, 15, 18, 9, 13, 9, 3, 3,
18, 13, 6, 19]),
(0, 49, 49))
but the computational efficiency will be very small considering big-size vectors of N and k (my real-world case is 16*200 for N*k+800*200 for 't*k`, so it will be 16^200 iterations with 800*200 matrices :(
Of course, I considered numba solution, but it does not allow me to significantly speed up the calculation (i.e. it still takes a lot of time!).
I'm wondering about alternative, more computationally efficient ways to solve the problem.
Thanks!
EDIT: The question was significantly changed to clarify the problem. I appreciate the people who helped me to do it!

How to generate random numbers to satisfy a specific mean and median in python?

I would like to generate n random numbers e.g., n=200, where the range of possible values is between 2 and 40 with a mean of 12 and median is 6.5.
I searched everywhere and i could not find a solution for this. I tried the following script by it works for small numbers such as 20, for big numbers it takes ages and result is returned.
n=200
x = np.random.randint(0,1,size=n) # initalisation only
while True:
if x.mean() == 12 and np.median(x) == 6.5:
break
else:
x=np.random.randint(2,40,size=n)
Could anyone help me by improving this to get a quick result even when n=5000 or so?
One way to get a result really close to what you want is to generate two separate random ranges with length 100 that satisfies your median constraints and includes all the desire range of numbers. Then by concatenating the arrays the mean will be around 12 but not quite equal to 12. But since it's just mean that you're dealing with you can simply generate your expected result by tweaking one of these arrays.
In [162]: arr1 = np.random.randint(2, 7, 100)
In [163]: arr2 = np.random.randint(7, 40, 100)
In [164]: np.mean(np.concatenate((arr1, arr2)))
Out[164]: 12.22
In [166]: np.median(np.concatenate((arr1, arr2)))
Out[166]: 6.5
Following is a vectorized and very much optimized solution against any other solution that uses for loops or python-level code by constraining the random sequence creation:
import numpy as np
import math
def gen_random():
arr1 = np.random.randint(2, 7, 99)
arr2 = np.random.randint(7, 40, 99)
mid = [6, 7]
i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
decm, intg = math.modf(i)
args = np.argsort(arr2)
arr2[args[-41:-1]] -= int(intg)
arr2[args[-1]] -= int(np.round(decm * 40))
return np.concatenate((arr1, mid, arr2))
Demo:
arr = gen_random()
print(np.median(arr))
print(arr.mean())
6.5
12.0
The logic behind the function:
In order for us to have a random array with that criteria we can concatenate 3 arrays together arr1, mid and arr2. arr1 and arr2 each hold 99 items and the mid holds 2 items 6 and 7 so that make the final result to give as 6.5 as the median. Now we an create two random arrays each with length 99. All we need to do to make the result to have a 12 mean is to find the difference between the current sum and 12 * 200 and subtract the result from our N largest numbers which in this case we can choose them from arr2 and use N=50.
Edit:
If it's not a problem to have float numbers in your result you can actually shorten the function as following:
import numpy as np
import math
def gen_random():
arr1 = np.random.randint(2, 7, 99).astype(np.float)
arr2 = np.random.randint(7, 40, 99).astype(np.float)
mid = [6, 7]
i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
args = np.argsort(arr2)
arr2[args[-40:]] -= i
return np.concatenate((arr1, mid, arr2))
Here, you want a median value lesser than the mean value. That means that a uniform distribution is not appropriate: you want many little values and fewer great ones.
Specifically, you want as many value lesser or equal to 6 as the number of values greater or equal to 7.
A simple way to ensure that the median will be 6.5 is to have the same number of values in the range [ 2 - 6 ] as in [ 7 - 40 ]. If you choosed uniform distributions in both ranges, you would have a theorical mean of 13.75, which is not that far from the required 12.
A slight variation on the weights can make the theorical mean even closer: if we use [ 5, 4, 3, 2, 1, 1, ..., 1 ] for the relative weights of the random.choices of the [ 7, 8, ..., 40 ] range, we find a theorical mean of 19.98 for that range, which is close enough to the expected 20.
Example code:
>>> pop1 = list(range(2, 7))
>>> pop2 = list(range(7, 41))
>>> w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)
>>> r1 = random.choices(pop1, k=2500)
>>> r2 = random.choices(pop2, w2, k=2500)
>>> r = r1 + r2
>>> random.shuffle(r)
>>> statistics.mean(r)
12.0358
>>> statistics.median(r)
6.5
>>>
So we now have a 5000 values distribution that has a median of exactly 6.5 and a mean value of 12.0358 (this one is random, and another test will give a slightly different value). If we want an exact mean of 12, we just have to tweak some values. Here sum(r) is 60179 when it should be 60000, so we have to decrease 175 values which were neither 2 (would go out of range) not 7 (would change the median).
In the end, a possible generator function could be:
def gendistrib(n):
if n % 2 != 0 :
raise ValueError("gendistrib needs an even parameter")
n2 = n//2 # n / 2 in Python 2
pop1 = list(range(2, 7)) # lower range
pop2 = list(range(7, 41)) # upper range
w2 = [ 5, 4, 3, 2 ] + ( [1] * 30) # weights for upper range
r1 = random.choices(pop1, k=n2) # lower part of the distrib.
r2 = random.choices(pop2, w2, k=n2) # upper part
r = r1 + r2
random.shuffle(r) # randomize order
# time to force an exact mean
tot = sum(r)
expected = 12 * n
if tot > expected: # too high: decrease some values
for i, val in enumerate(r):
if val != 2 and val != 7:
r[i] = val - 1
tot -= 1
if tot == expected:
random.shuffle(r) # shuffle again the decreased values
break
elif tot < expected: # too low: increase some values
for i, val in enumerate(r):
if val != 6 and val != 40:
r[i] = val + 1
tot += 1
if tot == expected:
random.shuffle(r) # shuffle again the increased values
break
return r
It is really fast: I could timeit gendistrib(10000) at less than 0.02 seconds. But it should not be used for small distributions (less than 1000)
Ok, you're looking at the distribution which has no less than 4 parameters - two of those defining range and two responsible for required mean and median.
I could think about two possibilities from the top of my head:
Truncated normal distribution, look here for details. You have already range defined, and have to recover μ and σ from mean and median. It will require solving couple of nonlinear equation, but quite doable in python. Sampling could be done using https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html
4-parameters Beta distribution, see here for details. Again, recovering α and β in Beta distribution from mean and median will require solving couple of non-linear equations. Knowing them sampling would be easy via https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.beta.html
UPDATE
Here how you could do it for truncated normal going from mean to mu: Truncated normal with a given mean
If you have a bunch of smaller arrays with the right median and mean, you can combine them to produce a larger array.
So... you can pre-generate smaller arrays as you are currently doing, and then combine them randomly for larger n. Of course, this will result in a biased random sample, but it sounds like you just want something that's approximately random.
Here's working (py3) code that generates a sample of size 5000 with your desired properties, which it build from smaller samples of size 4, 6, 8, 10, ..., 18.
Note, that I changed how the smaller random samples are built: half of the numbers must be <= 6 and half >= 7 if the median is to be 6.5, so we generate those halves independently. This speeds things up massively.
import collections
import numpy as np
import random
rs = collections.defaultdict(list)
for i in range(50):
n = random.randrange(4, 20, 2)
while True:
x=np.append(np.random.randint(2, 7, size=n//2), np.random.randint(7, 41, size=n//2))
if x.mean() == 12 and np.median(x) == 6.5:
break
rs[len(x)].append(x)
def random_range(n):
if n % 2:
raise AssertionError("%d must be even" % n)
r = []
while n:
i = random.randrange(4, min(20, n+1), 2)
# Don't be left with only 2 slots left.
if n - i == 2: continue
xs = random.choice(rs[i])
r.extend(xs)
n -= i
random.shuffle(r)
return r
xs = np.array(random_range(5000))
print([(i, list(xs).count(i)) for i in range(2, 41)])
print(len(xs))
print(xs.mean())
print(np.median(xs))
Output:
[(2, 620), (3, 525), (4, 440), (5, 512), (6, 403), (7, 345), (8, 126), (9, 111), (10, 78), (11, 25), (12, 48), (13, 61), (14, 117), (15, 61), (16, 62), (17, 116), (18, 49), (19, 73), (20, 88), (21, 48), (22, 68), (23, 46), (24, 75), (25, 77), (26, 49), (27, 83), (28, 61), (29, 28), (30, 59), (31, 73), (32, 51), (33, 113), (34, 72), (35, 33), (36, 51), (37, 44), (38, 25), (39, 38), (40, 46)]
5000
12.0
6.5
The first line of the output shows that there's 620 2's, 52 3's, 440 4's etc. in the final array.
While this post already has an accepted answer, I'd like to contribute a general non integer approach. It does not need loops or testing. The idea is to take a PDF with compact support. Taking the idea of the accepted answer of Kasrâmvd, make two distributions in the left and right interval. Chose shape parameters such that the mean falls to the given value. The interesting opportunity here is that one can create a continuous PDF, i.e. without jumps where the intervals join.
As an example I have chosen the beta distribution. To have finite non-zero values at the border I've chosen beta =1 for the left and alpha = 1 for the right.
Looking at the definition of the PDF and the requirement of the mean the continuity gives two equations:
4.5 / alpha = 33.5 / beta
2 + 6.5 * alpha / ( alpha + 1 ) + 6.5 + 33.5 * 1 / ( 1 + beta ) = 24
This is a quadratic equation rather easy to solve. The just using scipy.stat.beta like
from scipy.stats import beta
import matplotlib.pyplot as plt
import numpy as np
x1 = np.linspace(2, 6.5, 200 )
x2 = np.linspace(6.5, 40, 200 )
# i use s and t not alpha and beta
s = 1./737 *(np.sqrt(294118) - 418 )
t = 1./99 *(np.sqrt(294118) - 418 )
data1 = beta.rvs(s, 1, loc=2, scale=4.5, size=20000)
data2 = beta.rvs(1, t, loc=6.5, scale=33.5, size=20000)
data = np.concatenate( ( data1, data2 ) )
print np.mean( data1 ), 2 + 4.5 * s/(1.+s)
print np.mean( data2 ), 6.5 + 33.5/(1.+t)
print np.mean( data )
print np.median( data )
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.hist(data1, bins=13, density=True )
ax.hist(data2, bins=67, density=True )
ax.plot( x1, beta.pdf( x1, s, 1, loc=2, scale=4.5 ) )
ax.plot( x2, beta.pdf( x2, 1, t, loc=6.5, scale=33.5 ) )
ax.set_yscale( 'log' )
plt.show()
provides
>> 2.661366939244768 2.6495436216856976
>> 21.297348804473618 21.3504563783143
>> 11.979357871859191
>> 6.5006779033245135
so results are as required and it looks like:

Categories

Resources