Getting scipy's rv_discrete to work with floating point values?

Getting scipy's rv_discrete to work with floating point values? - python

I'm trying to define my own discrete distribution. The code I have works for integer values but not for decimal values. For example, this works:
>>> from scipy.stats import rv_discrete
>>> probabilities = [0.2, 0.5, 0.3]
>>> values = [1, 2, 3]
>>> distrib = rv_discrete(values=(values, probabilities))
>>> print distrib.rvs(size=10)
[1 3 3 2 2 2 2 2 1 3]
But if I use decimal values, it doesn't work:
>>> from scipy.stats import rv_discrete
>>> probabilities = [0.2, 0.5, 0.3]
>>> values = [.1, .2, .3]
>>> distrib = rv_discrete(values=(values, probabilities))
>>> print distrib.rvs(size=10)
[0 0 0 0 0 0 0 0 0 0]
Thanks..

Per stats.rv_discrete's doc string:
values : tuple of two array_like, optional
(xk, pk) where xk are integers with non-zero
probabilities pk with sum(pk) = 1.
(my emphasis). So the discrete distributions created by rv_discrete must use integer values. However, it is not hard to map those integer values to floats by using the rvs values as integer indices into values:
In [4]: values = np.array([0.1, 0.2, 0.3])
In [5]: idx = distrib.rvs(size=10); idx
Out[5]: array([1, 1, 0, 0, 1, 1, 0, 2, 1, 1])
In [6]: values[idx]
Out[6]: array([ 0.2, 0.2, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.2, 0.2])
Thus you could use:
import numpy as np
import scipy.stats as stats
np.random.seed(2016)
probabilities = np.array([0.2, 0.5, 0.3])
values = np.array([0.1, 0.2, 0.3])
distrib = stats.rv_discrete(values=(range(len(probabilities)), probabilities))
idx = distrib.rvs(size=10)
result = values[idx]
print(result)
# [ 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.3 0.3 0.2]

Related

Getting bincount of float values

Wondering if there is an easy function in numpy to get counts of values within a ranges. For example
import numpy as np
rand_vals = np.random.rand(10)
#Out > arrayarray([[0.15068161, 0.51291888, 0.99576726, 0.05944532, 0.72641707,
0.09693093, 0.61988549, 0.19811334, 0.88184011, 0.16775108]])
bins = np.linspace(0,1,11)
#Out> array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
#Expected Out > [2, 3, 0, 0, 0, 1, 1, 1, 1, 1]
#The first entry is 2 since there are two values between 0 to 0.1 (0.0584432, 0.09693093)
#The second entry is 3 since there are 3 values between 0.1 to 0.2 (0.15068161, 0.1981134, 0.16775108)
#So on ..

You can use numpy.histogram():
import numpy as np
bincount, bins = np.histogram(rand_vals, bins=np.linspace(0,1,11))
print(bincount) # => [0 2 1 0 2 0 1 1 2 1]
print(bins) # => [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

How do i create a majority voting based in two arrays?

Scenario:
I want to create a majority vote system based that takes into account the weight of someone's vote about N observations.
So, M observers will give their guess about N observations, selecting from 3 classes (1,2,3). For each observation, each observer will have a weight associated with it.
Defining:
G: Matrix of guesses per observation / observer (N observations × M observers);
W: Weights for each observation / observer (N observations × M observers)
Example:
# 2 observations, 3 observers
G = [[1, 2, 3],
[2, 2, 1]]
# Weights (influence) each observer has about each observation
W = [[0.1, 0.2, 0.3],
[0.3, 0.1, 0.2]]
I need to compute another matrix with shape (N observations × C classes) that stores the probability of an observation comes from an specific class.
Example using values above:
G = [[1, 2, 3],
[2, 2, 1]]
W = [[0.1, 0.2, 0.3],
[0.3, 0.1, 0.2]]
P = [[0.1, 0.2, 0.3],
[0.2, (0.3 + 0.1), 0]]
After computing the P matrix, I could apply np.argmax() row-wise to get the column (class) with highest value:
P = [[0.1, 0.2, 0.3], #class 3 has highest value (0.3)
[0.2, 0.4, 0]] #class 2 has highest value (0.4)
result = [3, 2]
I would like to know how can I combine G and W to generate the P matrix.

You can get the job done in a vectorized manner by using NumPy's indices and advanced indexing:
In [569]: import numpy as np
In [570]: G = np.array([[1, 2, 3], [2, 2, 1]] )
In [571]: W = np.array([[0.1, 0.2, 0.3], [0.3, 0.1, 0.2]])
In [572]: C = 3
In [573]: M, N = G.shape
In [574]: row, col = np.indices((M, N))
In [575]: P3d = np.zeros(shape=(M, N, C))
In [576]: P3d[row, col, G-1] = W
In [577]: P = P3d.sum(axis=1)
In [578]: P
Out[578]:
array([[0.1, 0.2, 0.3],
[0.2, 0.4, 0. ]])

Initialize P with zero values then iterate by observations/rows of G and value of index i.e g[observation][index] if class 1 then add weight[observation][index] from W matrix to P[observation][class]+=weight[observation][index]. i.e in your sample testcase. for row 1. index 0 has value 1 and weight[0][0] is 0.1 so add 0.1 to row 0 and index[class] of P. similarly for index 2 and 3 value are same as index therefore same in P.
Now for row 2, index 1 has class 2 so we add weight of class 2 to p[2][class]+=0.3 and for index 2 class is again 2 so weight of that observer is 0.1 so again p[2][class]+=weight i.e 0.1. for last index class is 1 so p[2][class]+=weight now Our matrix is ready so use np.argmax() for required answer.

Add a scalar to a numpy matrix based on the indices in a different numpy array

I'm sorry if this question isn't framed well. So I would rather explain with an example.
I have a numpy matrix:
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
And another numpy array as shown:
b = np.array([1, 0, 2, 2])
With the given condition that values in b will be in the range(a.shape[1]) and that b.shape[1] == a.shape[0]. Now this is the operation I need to perform.
For every index i of a, and the corresponding index i of b, I need to subtract 1 from the index j of a[i] where j== b[i]
So in my example, a[0] == [0.5 0.8 0.1] and b[0] == 1. Therefore I need to subtract 1 from a[0][b[0]] so that a[0] = [0.5, -0.2, 0.1]. This has to be done for all rows of a. Any direct solution without me having to iterate through all rows or columns one by one?
Thanks.

Use numpy indexing. See this post for a nice introduction:
import numpy as np
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
b = np.array([1, 0, 2, 2])
a[np.arange(a.shape[0]), b] -= 1
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
As an alternative use substract.at:
np.subtract.at(a, (np.arange(a.shape[0]), b), 1)
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
The main idea is that:
np.arange(a.shape[0]) # shape[0] is equals to the numbers of rows
generates the indices of the rows:
[0 1 2 3]

Accumulate partial reductions into array in numpy

Problem description
How do I accumulate into a the values in c using b to index into a? That is, given
import numpy as np
a = np.zeros(3)
b = np.array([2, 1, 0, 1])
c = np.arange(0.1, 0.5, 0.1)
print ('a=%s b=%s c=%s'.replace(' ', '\n') % (str(a), str(b), str(c)))
which outputs
a=[ 0. 0. 0.]
b=[2 1 0 1]
c=[ 0.1 0.2 0.3 0.4]
how do I achieve
d = np.array([0.3, 0.2 + 0.4, 0.1])
print 'd=%s' % str(d)
which outputs
d=[ 0.3 0.6 0.1]
using a, b, and c without using a for loop?
My solution attempt
I can sort b and then sort c using the indices that sorted b
p = b.argsort()
print ('b[p]=%s c[p]=%s'.replace(' ', '\n') % (str(b[p]), str(c[p])))
which outputs
b[p]=[0 1 1 2]
c[p]=[ 0.3 0.2 0.4 0.1]
then reduce b to occurrence counts
occ = np.bincount(b[p])
print 'occ=%s' % str(occ)
which outputs
occ=[1 2 1]
and use this to compute partials sums
print np.array([np.sum(c[p][0:occ[0]]),
np.sum(c[p][occ[0]:occ[0]+occ[1]]),
np.sum(c[p][occ[0]+occ[1]:occ[0]+occ[1]+occ[2]])])
which outputs
[ 0.3 0.6 0.1]
How do I generalize this?
All code and output
import numpy as np
a = np.zeros(3)
b = np.array([2, 1, 0, 1])
c = np.arange(0.1, 0.5, 0.1)
print ('a=%s b=%s c=%s'.replace(' ', '\n') % (str(a), str(b), str(c)))
d = np.array([0.3, 0.2 + 0.4, 0.1])
print 'd=%s' % str(d)
p = b.argsort()
print ('b[p]=%s c[p]=%s'.replace(' ', '\n') % (str(b[p]), str(c[p])))
occ = np.bincount(b[p])
print 'occ=%s' % str(occ)
print np.array([np.sum(c[p][0:occ[0]]),
np.sum(c[p][occ[0]:occ[0]+occ[1]]),
np.sum(c[p][occ[0]+occ[1]:occ[0]+occ[1]+occ[2]])])
which outputs
a=[ 0. 0. 0.]
b=[2 1 0 1]
c=[ 0.1 0.2 0.3 0.4]
d=[ 0.3 0.6 0.1]
b[p]=[0 1 1 2]
c[p]=[ 0.3 0.2 0.4 0.1]
occ=[1 2 1]
[ 0.3 0.6 0.1]

np.bincount does exactly what you want:
>>> import numpy as np
>>>
>>> b = [2, 1, 0, 1]
>>> c = np.arange(0.1, 0.5, 0.1)
>>> c
array([0.1, 0.2, 0.3, 0.4])
>>> np.bincount(b, c)
array([0.3, 0.6, 0.1])
There is also np.add.at but unless the update is very sparse in a it is much slower.
>>> a = np.zeros(3)
>>> np.add.at(a, b, c)
>>> a
array([0.3, 0.6, 0.1])

If you can use pandas, then solution in one row:
import pandas as pd
a = pd.DataFrame({'b':b,'c':c}).groupby('b')['c'].sum().reset_index()
Output:
b c
0 0.3
1 0.6
2 0.1
If you need then numpy array, wrap up the nessecery column to numpy:
import pandas as pd
a = pd.DataFrame({'b':b,'c':c}).groupby('b')['c'].sum().reset_index()
b = np.array(a['b'])
c = np.array(a['c'])

python; counting elements of vectors

I would like to count and save in a vector a the number of elements of an array that are greater than a certain value t. I want to do this for different ts.
eg
My vector:c=[0.3 0.2 0.3 0.6 0.9 0.1 0.2 0.5 0.3 0.5 0.7 0.1]
I would like to count the number of elements of c that are greater than t=0.9, than t=0.8 than t=0.7 etc... I then want to save the counts for each different value of t in a vector
my code is (not working):
for t in range(0,10,1):
for j in range(0, len(c)):
if c[j]>t/10:
a.append(sum(c[j]>t))
my vector a should be of dimension 10, but it isn't!
Anybody can help me out?

I made a function that loops over the array and just counts whenever the value is greater than the supplied threshold
c=[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
def num_bigger(threshold):
count = 0
for num in c:
if num > threshold:
count +=1
return count
thresholds = [x/10.0 for x in range(10)]
for thresh in thresholds:
print thresh, num_bigger(thresh)
Note that the function checks for strictly greater, which is why, for example, the result is 0 when the threshold is .9.

There are few things wrong with your code.
my vector a should be of dimension 10, but it isn't!
That's because you don't append only 10 elements in your list. Look at your logic.
for t in range(0,10,1):
for j in range(0, len(c)):
if c[j]>t/10:
a.append(sum(c[j]>t))
For each threshold, t, you iterate over all 12 items in c one at a time and you append something to the list. Overall, you get 120 items. What you should have been doing instead is (in pseudocode):
for each threshold:
count = how many elements in c are greater than threshold
a.append(count)
numpy.where() gives you the indices in an array where a condition is satisfied, so you just have to count how many indices you get each time. We'll get to the full solution is a moment.
Another potential error is t/10, which in Python 2 is integer division and will return 0 for all thresholds. The correct way would be to force float division with t/10.. If you're on Python 3 though, you get float division by default so this might not be a problem. Notice though that you do c[j] > t, where t is between 0 and 10. Overall, your c[j] > t logic is wrong. You want to use a counter for all elements, like other answers have shown you, or collapse it all down to a one-liner list comprehension.
Finally, here's a solution fully utilising numpy.
import numpy as np
c = np.array([0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1])
thresh = np.arange(0, 1, 0.1)
counts = np.empty(thresh.shape, dtype=int)
for i, t in enumerate(thresh):
counts[i] = len(np.where(c > t)[0])
print counts
Output:
[12 10 8 5 5 3 2 1 1 0]
Letting numpy take care of the loops under the hood is faster than Python-level loops. For demonstration:
import timeit
head = """
import numpy as np
c = np.array([0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1])
thresh = np.arange(0, 1, 0.1)
"""
numpy_where = """
for t in thresh:
len(np.where(c > t)[0])
"""
python_loop = """
for t in thresh:
len([element for element in c if element > t])
"""
n = 10000
for test in [numpy_where, python_loop]:
print timeit.timeit(test, setup=head, number=n)
Which on my computer results in the following timings.
0.231292377372
0.321743753994

Your problem is here:
if c[j]>t/10:
Notice that both t and 10 are integers and so you perform integer division.
The easiest solution with the least changes is to change it to:
if c[j]>float(t)/10:
to force float division
So the whole code would look something like this:
a = []
c = [0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
for i in range(10): #10 is our 1.0 change it to 9 if you want to iterate to 0.9
sum = 0
cutoff = float(i)/10
for ele in c:
if ele <= cutoff:
sum += ele
a.append(sum)
print(len(a)) # prints 10, the numbers from 0.0 - 0.9
print(a) # prints the sums going from 0.0 cutoff to 1.0 cutoff

You have to divide t / 10.0 so the result is a decimal, the result of t / 10 is an integer
a = []
c=[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
for t in range(0,10,1):
count = 0
for j in range(0, len(c)):
if c[j]>t/10.0:
count = count+1
a.append(count)
for t in range(0,10,1):
print(str(a[t]) + ' elements in c are bigger than ' + str(t/10.0))
Output:
12 elements in c are bigger than 0.0
10 elements in c are bigger than 0.1
8 elements in c are bigger than 0.2
5 elements in c are bigger than 0.3
5 elements in c are bigger than 0.4
3 elements in c are bigger than 0.5
2 elements in c are bigger than 0.6
1 elements in c are bigger than 0.7
1 elements in c are bigger than 0.8
0 elements in c are bigger than 0.9
You can check the test here

If you simplify your code bugs won't have places to hide!
c=[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
a=[]
for t in [x/10 for x in range(10)]:
a.append((t,len([x for x in c if x>t])))
a
[(0.0, 12),
(0.1, 10),
(0.2, 8),
(0.3, 5),
(0.4, 5),
(0.5, 3),
(0.6, 2),
(0.7, 1),
(0.8, 1),
(0.9, 0)]
or even this one-liner
[(r/10,len([x for x in c if x>r/10])) for r in range(10)]

It depends on the sizes of your arrays, but your current solution has O(m*n) complexity, m being the number of values to test and n the size of your array. You may be better off with O((m+n)*log(n)) by first sorting your array in O(n*log(n)) and then using binary search to find the m values in O(m*log(n)). Using numpy and your sample c list, this would be something like:
>>> c
[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
>>> thresholds = np.linspace(0, 1, 10, endpoint=False)
>>> thresholds
array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
>>> len(c) - np.sort(c).searchsorted(thresholds, side='right')
array([12, 10, 8, 5, 5, 3, 2, 1, 1, 0])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting scipy's rv_discrete to work with floating point values? - python

Related

Getting bincount of float values

How do i create a majority voting based in two arrays?

Add a scalar to a numpy matrix based on the indices in a different numpy array

Accumulate partial reductions into array in numpy

python; counting elements of vectors

Categories

Resources