How can I reproduce a the following probability function in Python language? - python

I have a task where I have a list of certain values: l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]. I have a formula for computing a kind of probability on this list as the following (the probability is high in case there is many different values in the list and low if there are few kind of values):
$ p = - \sum_{i=1}^m f_i log_m f_i $
where m is the length of the list, $f_i$ is the frequency of the ith element of the list.
I want to code this in Python with the following:
from math import log
from collections import Counter
-sum([loc*log(loc, len(set(l))) for loc in Counter(l).values()])
But I somehow suspect that this is not the right way. Any better idea?
Additionally: I do not understand the negative sign in the formula, what is the explanation of this?

Here an alternative way to calculate the Entropy of the list using numpy:
import numpy as np
arr = np.array(l)
elem, c = np.unique(arr, return_counts=True)
# occurrences to probabilities
pc = c / c.sum()
# calculate the entropy (and account for log_m)
entropy = -np.sum(pc * np.log(pc)) * (1/np.log(len(c)))

Although the numpy array is a better solution, in case you don't want to use numpy:
You would be better if you saved the counter and use len(Counter) instead of len(set(l)), so that you don't recalculate in in every iteration. len(Counter) is the same as len(set(l)), but does not get recalculated in every iteration (I assume you use cpython3.x )
If you don't get the desired result, then probably your formula is wrong
In your code you use len(set(l)) and not len(l) and you iterate over the frequencies, not the list which is not what you describe in your formula.
You don't need to wrap the expression inside sum within a list since you only need to iterate over it once (Generator expressions vs. list comprehensions)
EDIT: As to why you get a negative result, this is expected
You sum over f[i] * log(f[i]) >= 0
f[i] >= 1: The frequency of ith element of the list
log(f[i]) >= 0 because f[i] >= 1: The log of each frequency in any base (base doesn't matter).
And then take the negative of that. The result will always be less that or equal to 0.
from math import log
from collections import Counter
l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]
f = Counter(l)
# This is from your code
p1 = -sum(f[e] * log(f[e], len(f)) for e in f)
# This is from your formula
p2 = -sum(f[e] * log(f[e], len(l)) for e in l)
print(p1, p2)

Related

Fastest way to sample most numbers with minimum difference larger than a value from a Python list

Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1.. Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations. As shown below, the code finds a subset after 4 s for a list of 20 numbers.
from itertools import combinations
import random
from time import time
mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]
t0 = time()
n = len(lst)
sample = []
found = False
while not found:
# get all subsets with size n
subsets = list(combinations(lst, n))
# shuffle to ensure randomness
random.shuffle(subsets)
for subset in subsets:
# sort the subset numbers
ss = sorted(subset)
# calculate the differences between every two adjacent numbers
diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
if min(diffs) > mindiff:
sample = set(subset)
found = True
break
# check subsets with size -1
n -= 1
print(sample)
print(time()-t0)
Output:
{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524
However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration. I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size. Any suggestions?
My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions. This answer assumes you want one that samples uniformly from all such optimal solutions.
Construct a directed acyclic graph G where there is one node for each point, and nodes a and b are connected when b - a > mindist. Also add two virtual nodes, s and t, where s -> x for all x and x -> t for all x.
Calculate for each node in G how many paths of length k exist to t. You can do this efficiently in O(n^2 k) time using dynamic programming with a table P[x][k], filling initially P[x][0] = 0 except P[t][0] = 1, and then P[x][k] = sum(P[y][k-1] for y in neighbors(x)).
Keep doing this until you reach the maximum k - you now know the size of the optimal subset.
Uniformly sample a path of length k from s to t using P to weight your choices.
This is done by starting at s. We then look at each neighbor of s and choose one randomly with a weighting dictated by P[s][k]. This gives us our first element of the optimal set.
We then repeatedly perform this step. We are at x, look at the neighbors of x and pick one randomly using weights P[x][k-i] where i is the step we're at.
Use the nodes you sampled in 3 as your random subset.
An implementation of the above in pure Python:
import random
def sample_mindist_subset(xs, mindist):
# Construct directed graph G.
n = len(xs)
s = n; t = n + 1 # Two virtual nodes, source and sink.
neighbors = {
i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
for i in range(n)}
neighbors[s] = [t] + list(range(n))
neighbors[t] = []
# Compute number of paths P[x][k] from x to t of length k.
P = [[0 for _ in range(n+2)] for _ in range(n+2)]
P[t][0] = 1
for k in range(1, n+2):
for x in range(n+2):
P[x][k] = sum(P[y][k-1] for y in neighbors[x])
# Sample maximum length path uniformly at random.
maxk = max(k for k in range(n+2) if P[s][k] > 0)
path = [s]
while path[-1] != t:
candidates = neighbors[path[-1]]
weights = [P[cn][maxk-len(path)] for cn in candidates]
path.append(random.choices(candidates, weights)[0])
return [xs[i] for i in path[1:-1]]
Note that if you want to sample from the same set of numbers many times, you don't have to recompute P every single time and can re-use it.
I probably don't fully understand the question, because right now the solution is quite trivial. EDIT: yes, I misunderstood after all, the OP does not just want an optimal solution, but wishes to randomly sample from the set of optimal solutions. This answer is not incorrect but it also is an answer to a different question than what OP is interested in.
Simply sort the numbers and greedily construct the subset:
def mindist_subset(xs, mindist):
result = []
for x in sorted(xs):
if not result or x - result[-1] > mindist:
result.append(x)
return result
Sketch of proof of correctness.
Suppose we have a solution S given input array A that is of optimal size. If it does not contain min(A) note that we could remove min(S) from S and add min(A) since this would only increase the distance between min(S) and the second smallest number in S. Conclusion: we can without loss of generality assume that min(A) is part of an optimal solution.
Now we can apply this argument recursively. We add min(A) to a solution and remove all elements too close to min(A), giving remaining elements A'. Then we're left with a subproblem where exactly the same argument applies, we can choose min(A') as our next element of the solution, etc.

Adding ranges to count overlap python

I have a list of ranges. I now would like to compute a dictionary key : value, where key is the number and value is in how many of those ranges the number exists.
A bad way to compute this is:
from collections import defaultdict
my_dict = defaultdict(int)
ranges = [range(-4200,4200), range(-420,420), range(-42,42), range(8,9), range(9,9), range(9,10)]
for singleRange in ranges:
for number in singleRange:
my_dict[number] += 1
sort_dict = sorted(my_dict.items(), key=lambda x: x[1], reverse=True)
print(sort_dict)
How would you do this more efficiently?
Improving on my previous answer, this algorithm solves the problem in O(n + m) where n is the length of the total range and m is the number of sub ranges.
The basic idea is to iterate through the n numbers just once, keeping a counter of the number of ranges the current number belongs to. At each step, we check if we have passed a range start, in which case the counter gets incremented. Conversely, if we have have passed a range stop, the counter gets decremented.
The actual implementation below uses numpy and pandas for all the heavy lifting, so the iterative nature of the algorithm may seem unclear, but it's basically just a vectorized version of what I've described.
Compared to the 600 ms of my previous answer, we're down to 20 ms for 10k ranges on my laptop. Moreover, the memory usage is also O(n + m) here while it was O(nm) there, so much larger n and m become possible. You should probably use this solution instead of the first version.
from collections import defaultdict
import numpy as np
import pandas as pd
# Generate data
def generate_ranges(n):
boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
boundaries.sort(axis=1)
return [range(x, y) for x, y in boundaries]
ranges = generate_ranges(10_000)
# Extract boundaries
boundaries = np.array([[range.start, range.stop] for range in ranges])
# Add a +1 offset for range starts and -1 for range stops
offsets = np.array([1, -1])[None, :].repeat(boundaries.shape[0], axis=0)
boundaries = np.stack([boundaries, offsets], axis=-1)
boundaries = boundaries.reshape(-1, 2)
# Compute range counts at each crossing of a range boundary
df = pd.DataFrame(boundaries, columns=["n", "offset"])
df = df.sort_values("n")
df["count"] = df["offset"].cumsum()
df = df.groupby("n")["count"].max()
# Expand to all integers by joining and filling NaN
index = pd.RangeIndex(df.index[0], df.index[-1] + 1)
df = pd.DataFrame(index=index).join(df).fillna(method="ffill")
# Finally wrap the result in a defaultdict
d = defaultdict(int, df["count"].astype(int).to_dict())
Probably something more efficient can be done, but this solution has the advantage of heavily relying on the speed of numpy. For 10k ranges this runs in ~600 ms on my laptop.
from collections import defaultdict
import numpy as np
# Generate data
def generate_ranges(n):
boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
boundaries.sort(axis=1)
return [range(x, y) for x, y in boundaries]
ranges = generate_ranges(10_000)
# Extract boundaries
starts, stops = np.array([[range.start, range.stop] for range in ranges]).T
# Set of all numbers we should test
n = np.arange(starts.min(), stops.max() + 1)[:, None]
# Test those numbers
counts = ((n >= starts[None, :]) & (n < stops[None, :])).sum(axis=1)
# Wrap the result into a dict
d = defaultdict(int, dict(zip(n.flatten(), counts)))

Generate a list of 100 elements, with each element having a 50% chance of being 0, and a 50% chance of being a random number between 0 and 1

I am quite new in this and I am trying to learn on my own. As I said in the title, I am trying to create a list of 100 numbers whose elements are either 50% chance of being 0's or 50% change being a number between 0 and 1. I made it like the one below. It works but it is a very tedious and not well coded program. Any hints of how to make to make it better?
import random
import numpy as np
#define a list of 100 random numbers between 0 and 1
randomlist = []
for i in range(0,100):
n = random.uniform(0,1)
randomlist.append(n)
print(randomlist)
#create a list of 100 numbers of 0's and 1's
def random_binary_string(length):
sample_values = '01' # pool of strings
result_str = ''.join((random.choice(sample_values) for i in range(length)))
return (result_str)
l=100
x=random_binary_string(l)
x1=np.array(list(map(int, x)))
print(x1)
#combine both lists. Keep value if of the binary list if it is equal to zero. Else, substitute it by the value of randomlist
#corresponding to the index position
finalist=[]
for i in range(len(x1)):
if x1[i]==0:
finalist.append(x1[i])
else:
finalist.append(randomlist[i])
print(finalist)
Thanks a lot!
You can simplify your code by nesting the two conditions. This avoids the need to keep two separate lists in memory and then merge them at the end.
randomlist = []
for i in range(0,100):
if random.choice((0, 1)) == 1:
randomlist.append(random.uniform(0,1))
else:
randomlist.append(0)
This is simple and succinct enough that you can refactor it to a single list comprehension. This is more compact but somewhat less legible.
randomlist = [random.uniform(0,1) if random.choice((0, 1)) else 0 for i in range(0,100)]
Here, we also shorten the code slightly by exploiting the fact that 0 is falsey and 1 is truthy in Python; i.e. they evaluate to False and True, respectively, in a boolean context. So if random.choice((0, 1)) == 1 can be abbreviated to simply if random.choice((0, 1)).
Somewhat obscurely, you can further simplify this (in the sense, use less code) by observing that the expression B if not A else A can be short-circuited into the expression A and B. This is not very obvious if you are not very familiar with boolean logic, but I think you can work it out on paper.
randomlist = [random.choice((0, 1)) and random.uniform(0,1) for i in range(0,100)]
Demo: https://ideone.com/uGHS2Y
You could try doing something like this:
import random
def create_random_list():
random_list = list()
for _ in range(100):
if random.choice((True, False)):
random_list.append(0)
else:
random_list.append(random.uniform(0, 1))
return random_list
randomly_generated_list = create_random_list()
print(len(randomly_generated_list), randomly_generated_list)
# 100 [x_0,...,x_99]
I propose this method:
first generate a random list of 'A', and 'B' with random.choice. 50% 'A' and 50% 'B'
then replace 'A' by random number between 0 and 1
and replace 'B' by 0
code here:
import random
ll = [ random.choice(['A', 'B']) for x in range(200)]
print(ll, len(ll))
for i in range(len(ll)):
if ll[i] == 'A':
ll[i]=random.random()
else:
ll[i]=0
print(ll, len(ll))
Shorter code here:
import random
ll = [ random.choice([0, random.random()]) for x in range(200)]
print(ll, len(ll), ll.count(0))
Since you are using Numpy, I probably would do what follow:
Create the array of num_el elements using random.uniform
Consider if the problem of the excluded upper bound: [low, high)
Create a boolean matrix with probability p=0.5 between true and false: random.choice
Use the matrix to set some elements of the array to zero, by
That's the code:
num_el = 10
p = 0.5
res = np.random.uniform(0., 1., size=(1, num_el))
bool_mat = np.random.choice(a=[False, True], size=(1, num_el), p=[p, 1-p])
res[bool_mat] = 0.
res
# array([[0. , 0.51213168, 0. , 0.68230528, 0.5287728 ,
# 0.9072587 , 0. , 0.43078057, 0.89735872, 0. ]])
The approach to use depends on whether your objective is to get exactly half of the outcomes to be zeroes, or have the expected number of zeros be half the total. It wasn't clear from your question which way you viewed the problem, so I've implemented both approaches as functions.
If you want a deterministic fixed proportion of zeroes/non-zeroes, the first function in the code below will do the trick. It creates a list with the desired number of zeros and non-zeros, and then uses shuffling (which I timed to be faster than sampling). If you want exactly half, then obviously the argument n has to be even.
If your goal is a probabilistic 50% zeroes, use the second function.
import random
# Exactly floor(n / 2) outcomes are zeros, i.e., exactly half when n is even.
# This version is trivial to modify to give any desired proportion of zeros.
def make_rand_list_v1(n = 100):
m = n // 2
n -= m
ary = [random.random() for _ in range(n)] + [0] * m
random.shuffle(ary)
return ary
# Each outcome has probability 0.5 of being zero
def make_rand_list_v2(n = 100):
return [random.getrandbits(1) and random.uniform(0, 1) for _ in range(n)]

Random contiguous slice of list in Python based on a single random integer

Using a single random number and a list, how would you return a random slice of that list?
For example, given the list [0,1,2] there are seven possibilities of random contiguous slices:
[ ]
[ 0 ]
[ 0, 1 ]
[ 0, 1, 2 ]
[ 1 ]
[ 1, 2]
[ 2 ]
Rather than getting a random starting index and a random end index, there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
I need it that way, to ensure these 7 possibilities have equal probability.
Simply fix one order in which you would sort all possible slices, then work out a way to turn an index in that list of all slices back into the slice endpoints. For example, the order you used could be described by
The empty slice is before all other slices
Non-empty slices are ordered by their starting point
Slices with the same starting point are ordered by their endpoint
So the index 0 should return the empty list. Indices 1 through n should return [0:1] through [0:n]. Indices n+1 through n+(n-1)=2n-1 would be [1:2] through [1:n]; 2n through n+(n-1)+(n-2)=3n-3 would be [2:3] through [2:n] and so on. You see a pattern here: the last index for a given starting point is of the form n+(n-1)+(n-2)+(n-3)+…+(n-k), where k is the starting index of the sequence. That's an arithmetic series, so that sum is (k+1)(2n-k)/2=(2n+(2n-1)k-k²)/2. If you set that term equal to a given index, and solve that for k, you get some formula involving square roots. You could then use the ceiling function to turn that into an integral value for k corresponding to the last index for that starting point. And once you know k, computing the end point is rather easy.
But the quadratic equation in the solution above makes things really ugly. So you might be better off using some other order. Right now I can't think of a way which would avoid such a quadratic term. The order Douglas used in his answer doesn't avoid square roots, but at least his square root is a bit simpler due to the fact that he sorts by end point first. The order in your question and my answer is called lexicographical order, his would be called reverse lexicographical and is often easier to handle since it doesn't depend on n. But since most people think about normal (forward) lexicographical order first, this answer might be more intuitive to many and might even be the required way for some applications.
Here is a bit of Python code which lists all sequence elements in order, and does the conversion from index i to endpoints [k:m] the way I described above:
from math import ceil, sqrt
n = 3
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
b = 1 - 2*n
c = 2*(i - n) - 1
# solve k^2 + b*k + c = 0
k = int(ceil((- b - sqrt(b*b - 4*c))/2.))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))
The - 1 term in c doesn't come from the mathematical formula I presented above. It's more like subtracting 0.5 from each value of i. This ensures that even if the result of sqrt is slightly too large, you won't end up with a k which is too large. So that term accounts for numeric imprecision and should make the whole thing pretty robust.
The term k*(2*n-k+1)//2 is the last index belonging to starting point k-1, so i minus that term is the length of the subsequence under consideration.
You can simplify things further. You can perform some computation outside the loop, which might be important if you have to choose random sequences repeatedly. You can divide b by a factor of 2 and then get rid of that factor in a number of other places. The result could look like this:
from math import ceil, sqrt
n = 3
b = n - 0.5
bbc = b*b + 2*n + 1
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
k = int(ceil(b - sqrt(bbc - 2*i)))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))
It is a little strange to give the empty list equal weight with the others. It is more natural for the empty list to be given weight 0 or n+1 times the others, if there are n elements on the list. But if you want it to have equal weight, you can do that.
There are n*(n+1)/2 nonempty contiguous sublists. You can specify these by the end point, from 0 to n-1, and the starting point, from 0 to the endpoint.
Generate a random integer x from 0 to n*(n+1)/2.
If x=0, return the empty list. Otherwise, x is unformly distributed from 1 through n(n+1)/2.
Compute e = floor(sqrt(2*x)-1/2). This takes the values 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, etc.
Compute s = (x-1) - e*(e+1)/2. This takes the values 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, ...
Return the interval starting at index s and ending at index e.
(s,e) takes the values (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),...
import random
import math
n=10
x = random.randint(0,n*(n+1)/2)
if (x==0):
print(range(n)[0:0]) // empty set
exit()
e = int(math.floor(math.sqrt(2*x)-0.5))
s = int(x-1 - (e*(e+1)/2))
print(range(n)[s:e+1]) // starting at s, ending at e, inclusive
First create all possible slice indexes.
[0:0], [1:1], etc are equivalent, so we include only one of those.
Finally you pick a random index couple, and apply it.
import random
l = [0, 1, 2]
combination_couples = [(0, 0)]
length = len(l)
# Creates all index couples.
for j in range(1, length+1):
for i in range(j):
combination_couples.append((i, j))
print(combination_couples)
rand_tuple = random.sample(combination_couples, 1)[0]
final_slice = l[rand_tuple[0]:rand_tuple[1]]
print(final_slice)
To ensure we got them all:
for i in combination_couples:
print(l[i[0]:i[1]])
Alternatively, with some math...
For a length-3 list there are 0 to 3 possible index numbers, that is n=4. You have 2 of them, that is k=2. First index has to be smaller than second, therefor we need to calculate the combinations as described here.
from math import factorial as f
def total_combinations(n, k=2):
result = 1
for i in range(1, k+1):
result *= n - k + i
result /= f(k)
# We add plus 1 since we included [0:0] as well.
return result + 1
print(total_combinations(n=4)) # Prints 7 as expected.
there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
It is difficult to say what method is best but if you're only interested in binding single random number to your contiguous slice you can use modulo.
Given a list l and a single random nubmer r you can get your contiguous slice like that:
l[r % len(l) : some_sparkling_transformation(r) % len(l)]
where some_sparkling_transformation(r) is essential. It depents on your needs but since I don't see any special requirements in your question it could be for example:
l[r % len(l) : (2 * r) % len(l)]
The most important thing here is that both left and right edges of the slice are correlated to r. This makes a problem to define such contiguous slices that wont follow any observable pattern. Above example (with 2 * r) produces slices that are always empty lists or follow a pattern of [a : 2 * a].
Let's use some intuition. We know that we want to find a good random representation of the number r in a form of contiguous slice. It cames out that we need to find two numbers: a and b that are respectively left and right edges of the slice. Assuming that r is a good random number (we like it in some way) we can say that a = r % len(l) is a good approach.
Let's now try to find b. The best way to generate another nice random number will be to use random number generator (random or numpy) which supports seeding (both of them). Example with random module:
import random
def contiguous_slice(l, r):
random.seed(r)
a = int(random.uniform(0, len(l)+1))
b = int(random.uniform(0, len(l)+1))
a, b = sorted([a, b])
return l[a:b]
Good luck and have fun!

python function to determine how statistically likely it is that a number belongs to a given list of numbers

I am trying to find a function (in Python, ideally) that will tell me how 'similar' a number is to a given list of numbers. The end goal is to find out which list a given number is more likely to be a member of.
For example, take the two lists:
a = [5,4,8,3,6,4,7,2]
b = [9,5,14,10,11,18,9]
the function should take a new number, and tell me how similar it is to a given list. For example lets assume a hypothetical 'isSimilar' function will return a percentage chance that a number could be a member of a provided list:
# 5 looks pretty similar to list 'a' but not list 'b'.
>>> print isSimilar(a,5)
.9
>>> print isSimilar(b,5)
.5
# 15 looks more similar to list 'b'
>>> print isSimilar(a,15)
.4
>>> print isSimilar(b,15)
.8
# 10 looks like it has roughly the same chance to be in both lists
>>> print isSimilar(a,10)
.41
>>> print isSimilar(b,10)
.5
Ideally this hypothetical function would take the standard deviation of the lists into consideration. So, for example, in the following two lists:
a = [5,6,4,5]
b = [1,9,2,8]
the number '5' is more 'similar' to list 'a' than 'b' because the std deviation of the numbers in 'a' is much smaller.
Any help pointing me in the right direction would be much appreciated.
How about using an estimated pdf for both sets?
def get_most_likely_distribution_membership(value,d1,d2):
nparam_density1 = stats.kde.gaussian_kde(d1) # can use a different kernel
nparam_density2 = stats.kde.gaussian_kde(d2)
x = np.linspace(-20, 30, 200) # maybe pre-define a range
nparam_density1 = nparam_density1(x)
nparam_density2 = nparam_density2(x)
assert d1!=d2
if nparam_density1[np.where(abs(x-(value))==min(abs(x-(value))))].tolist() > nparam_density2[np.where(abs(x-(value))==min(abs(x-(value))))].tolist():
return 1
else:
return 2
Essentially, we're saying that if a single value the is more probable in a distribution, it's probably from that distribution.
Example:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
a = [5,4,8,3,6,4,7,2] # 1
b = [9,5,14,10,11,18,9] # 2
print(get_most_likely_distribution_membership(6,a,b))
print(get_most_likely_distribution_membership(10,a,b))
1 and 2, respectively.
Maybe something like this:
def isSimilar(a_list, member):
m_count = a_list.count(member)
return m_count / len(a_list)
Or perhaps using sets:
set(a_list).intersection( set(b_list))
which will return the intersection of the two lists, you could then count the resultant list and do some other maths.
Also consider using difflib if you are working with strings/sequences/etc: https://docs.python.org/2/library/difflib.html
Docs on list.count():
https://docs.python.org/2/tutorial/datastructures.html
So, I'm not exactly sure about the percentage thing. But, figuring out which list the number is more likely to belong to shouldn't be too difficult. I would just calculate the average difference between the number and all the numbers in the list. The closer the average distance is to 0, the more likely it is to be in the list.
def whichList(self, list1, list2, someNumber):
if self.averageDistance(someNumber,list1) < self.averageDistance(someNumber, list2):
print "list 1"
else:
print "list 2"
def averageDifference(self, someNumber,myList):
sum = 0
for num in myList:
sum = sum + math.fabs(num-someNumber)
return sum/len(myList)
Any 'percentage' would be subjective, but you could still use subjective numbers to rank. This approximates the list as normal distributions and samples from them to see the likelihood of drawing the number (in a discrete bin around it).
import numpy as np
from scipy.stats import norm
def isSimilar(x, A, N=10000):
M, S = np.mean(A), np.std(A)
test = lambda: x - 0.5 <= norm.rvs(loc=M, scale=S) <= x + 0.5
count = sum(test() for _ in xrange(N))
return 1. * count / N
def most_similar(x, *args):
scores = [(A, isSimilar(x, A)) for i, A in enumerate(args)]
sorted_scores = sorted(scores, key=lambda (i, s): s, reverse=True)
return sorted_scores[0][0]
A = [4,5,5,6]
B = [1,2,8,9]
C = [5,4,8,3,6,4,7,2]
most_similar(5, A, B, C) # returns [4,5,5,6]

Categories

Resources