Calculating Expected Value With Matrix Values - python

I have the following input data
class_p = [0.0234375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1748046875, 0.0439453125, 0.0, 0.35302734375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3828125]
league_p = [0.4765625, 0.0, 0.00634765625, 0.4658203125, 0.0, 0.0, 0.046875, 0.0, 0.0, 0.0029296875, 0.0, 0.0, 0.0, 0.0, 0.0]
a2_p = [0.1171875, 0.0, 0.0, 0.1171875, 0.0, 0.0078125, 0.30322265625, 0.31103515625, 0.0, 0.0, 0.0, 0.1435546875, 0.0, 0.0, 0.0]
p1_p = [0.0, 0.03125, 0.375, 0.09375, 0.0234375, 0.0, 0.46875, 0.0078125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
p2_p = [0.3984375, 0.0, 0.0, 0.3828125, 0.08935546875, 0.08935546875, 0.023345947265625, 0.007720947265625, 0.0, 0.0, 0.0087890625, 0.00018310546875, 0.0, 0.0, 0.0]
class_v = [55, 75, 55, 75, 500, 10000, 55, 55, 55, 75, 75, 55, 55, 500, 55, 55, 75, 75, 55, 55, 55]
league_v = [0, 0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 1500, 1500, 3000]
a2_v= [0, 0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 1500, 1500, 3000]
p1_v = [0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 40, 1500, 1500, 3000]
p2_v = [0, 0, 0, 0, 0, 0, 0, 0, 40, 40, 40, 40, 1500, 1500, 3000]
With that data, I am generating the odds of each combination occurring.
As an example to generate the chance of a given combination
class_p[0]
league_p[6]
a2_p[11]
p1_p[7]
p2_p[3]
I would multiply their values with each other
0.0234375x0.046875x0.1435546875x0.0078125x0.3828125
That would give me 4.716785042546689510345458984375 × 10^-7
Since the given combination had class_p[0], league_p[6], a2_p[11], p1_p[7], p2_p[3], I would take the following values in the "values" arrays.
I would sum
class_v[0] + league_v[6] + a2_v[11] + p1_v[7] + p2_v[3]
That would give me 55+0+40+40+0 = 135
To finalize the process I would do
(0.0234375*0.046875*0.1435546875*0.0078125*0.3828125)*(55+0+40+40+0) = 0.00006367659807
The full final calc is
(0.0234375×0.046875×0.1435546875×0.0078125×0.3828125) (55 + 0 + 40 + 40 + 0)
(combintation_chance)*(combination_value)
I need to do this process for all possible combinations of combintation_chance
This should give me a column of values(1xN). If I sum the values of that column I reach the EV overall, by summing the EV of individual combinations.
Calculating combintation_chance is working just fine. My issue is how to line up the given combination with its corresponding value sum (combination_value). At the moment, I have additional identifiers attached to the *_p arrays and I then do a string comparison with them to determine which combination value to use. This is very slow for billions of comparisons, therefore I am exploring a better approach.
I am using python 3.8 & numpy 1.24
Edit
The question has been adjusted to include much more detail

Broadcasting
Ok, so it seems that this is a simple broadcasting problem.
You want a 5D-array of probabilities, times a 5d-array of values. And, of course, you want it without any for loop.
In numpy the classical way to have numpy do nested loops for you (which is, indeed, way faster than doing them yourself. First rule of numpy is "avoid at all cost to iterate over elements. No for loop"), is to use broadcasting.
Let's start with 2d example (as was your first intention. And that was a good idea. Problem was it was ambiguous, but restraining your question to 2d was not bad).
You have
class_p = np.array([0.0234375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1748046875, 0.0439453125, 0.0, 0.35302734375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3828125])
league_p = np.array([0.4765625, 0.0, 0.00634765625, 0.4658203125, 0.0, 0.0, 0.046875, 0.0, 0.0, 0.0029296875, 0.0, 0.0, 0.0, 0.0, 0.0])
One way (not the only one, but probably the one easier to adapt to any similar question) is to use broadcasting.
If you indeed convert class_p in a column, that is a 21×1 2D array, and league_p into a line, that is a 1×15 2D array, then, if you multiply both, result will be a 21x15 2D array, containing all combinations.
Because
np.array([[1],[2],[3]]) * np.array([[4,5]])
is
[[4,5],
[8,10],
[12,15]]
That's how broadcasting works.
There are several way to convert a 1D-array so a row or a column of a 2D-array. For example you could use .reshape. Like class_p.reshape(-1,1) and league_p.reshape(1,-1). But the fastest is to add a new axis. Like class_p[:,None] and league_p[None,:]. Note that the second way doesn't really create a new array. It is just a different view of the same array. This is way it is faster.
So, our 2D probability map is
class_p[:,None]*league_p[None,:]
Likewise, to have all 21×15 combination of sum of values, you can rely on the same broadcasting to perform additon
class_v[:,None]+league_v[None,:]
Broadcasting solution
So solution, in 2D, using broadcasting, is
class_p[:,None]*league_p[None,:] * (class_v[:,None] + league_v[None,:])
In 5D, with all your variables, it is still manageable (but don't add too much dimensions! it would soon become a huge result. And I suspect what you are really interested at the end is just the sum of all that), this time, not in one line (not that it couldn't be done that way, but, that would be a big line...)
pr = class_p[:,None,None,None,None]*league_p[None,:,None,None,None]*a2_p[None,None,:,None,None]*p1_p[None,None,None,:,None]*p2_p[None,None,None,None,:]
vl = class_v[:,None,None,None,None]+league_v[None,:,None,None,None]+a2_v[None,None,:,None,None]+p1_v[None,None,None,:,None]+p2_v[None,None,None,None,:]
pr*vl
add.outer and multiply.outer
As you can see, in 5D, it is a little bit tedious. But I wanted to show you the principle of broadcasting, before introducing another (not really shorter, but a bit less tedious) way. Way that was already given by Reinderien. But since it was before you clarified the question, it was not the good result, but principle is the same
In 2D
np.multiply.outer(class_p, league_p) * np.add.outer(class_v, league_v)
Unfortunately, those function take only 2 args. So in 5D, you have to chain them
pr = np.multiply.outer(class_p, np.multiply.outer(league_p, np.multiply.outer(a2_p, np.multiply.outer(p1_p, p2_p))))
vl = np.add.outer(class_v, np.add.outer(league_v, np.add.outer(a2_v, np.add.outer(p1_v, p2_v))))
pr * vl
Expected value
Note that if the aim of all this is to compute the expected "value" (whatever that value is), that is Σ p(i,j,k,l,m)×v(i,j,k,l,m), for all possible outcomes, then, doing it that way is probably not a good idea.
For your example, it is manageable. You are computing "only" 1 million possible outcomes that is 1 million probabilities (each of them being 4 multiplications) and 1 million associated values (4 additions each). And the performing 1 million multiplication between those 2 sets of 1 million probabilities and values. And then summing the result, that is one extra million addition. Altogether, that is only 10 millions elementary arithmetic operation. Not much for a modern computer, and response still feels instantaneous. But, yet, it is O(Nᵏ) is both cpu and memory. N being the typical length of an array, and k the number of variables.
But if you intend to add more dimensions (more variables, associated with more set of probabilities and set of values), then that is unnecessary explosive, in both CPU time, and memory (those 5D arrays of probabilities and values are stored), or simply if you intend to perform this computation more than once, that expected value can be computed way faster, using just O(Nk) operations.
I spare you the development (but it is just a matter of expanding sum Σᵢⱼₖₗₘ pᵢpⱼpₖpₗpₘ (vᵢ+vⱼ+vₖ+vₗ+vₘ)), you can compute it faster like this
P1 = class_p.sum()
PV1 = (class_p*class_v).sum()
P2 = league_p.sum()
PV2 = (league_p*league_v).sum()
P3 = a2_p.sum()
PV3 = (a2_p*a2_v).sum()
P4 = p1_p.sum()
PV4 = (p1_p*p1_v).sum()
P5 = p2_p.sum()
PV5 = (p2_p*p2_v).sum()
expectedValue = P1*P2*P3*P4*PV5 + P1*P2*P3*PV4*P5 + P1*P2*PV3*P4*P5 + P1*PV2*P3*P4*P5 + PV1*P2*P3*P4*P5
sameAs = (pr*vl).sum()
It appears more complicated because there are more lines. But each line is along 1 dimension only. So it is replacing an order of magnitude of n₁n₂n₃n₄n₅ operations by an order of magnitude of n₁+n₂+n₃+n₄+n₅ operations, where n₁,...,n₅ are the size of arrays of each of the 5 variables.
So, again, if your objective is to compute expected value, then, trying to compute the 5D arrays (as your question is), is a really costly way.

This doesn't make any attempt to cache intermediate results, etc.
import numpy as np
class_percentages = (0.0, 0.0, 0.0, 0.3, 0.50)
league_percentages = (0.1, 0.0, 0.2, 0.1, 0.05)
class_values = (50, 50, 50, 75, 100)
league_values = (0, 10, 10, 25, 75)
combined = np.add.outer(class_percentages, league_percentages)*np.add.outer(class_values, league_values)
print(combined)
Output:
[[ 5. 0. 12. 7.5 6.25]
[ 5. 0. 12. 7.5 6.25]
[ 5. 0. 12. 7.5 6.25]
[30. 25.5 42.5 40. 52.5 ]
[60. 55. 77. 75. 96.25]]

Related

How to get K max values from a histogram?

I want to extract let say the 3 max values in a matplotlib histogram.
There are a lot of ways to extract the (unique) max value in a histogram, but I don't find anything about extract the 2-3 or 4 max values in a histogram.
I also want it to be automatic (not specific to the following case).
Here is my data and my code:
from matplotlib.pyplot import *
Angle=[0.0, 0.0, 0.0, 0.0, 1.5526165117219184, 0.0, 1.559560844536934, 0.0, 1.5554129250143014, 1.5529410816553442, 1.5458015331759765, -0.036680787756651845, 0.0, 0.0, 0.0, 0.0, -0.017855245139552514, -0.03224688243525392, 1.5422326689561365, 0.595918005516301, -0.06731387579270513, -0.011627382956383872, 1.5515679276951895, -0.06413211500143158, 0.0, -0.6123221322275954, 0.0, 0.0, 0.13863973713415806, 0.07677189126977804, -0.021735706841792667, 0.0, -0.6099169030770674, 1.546410917622178, 0.0, 0.0, -0.24111767845146836, 0.5961991412974801, 0.014704822377851432]
figure(1,figsize=(16,10))
plt.hist(Angle, bins=100,label='Angle')
show()
plt.hist outputs the bin heights, the bin boundaries and the rectangular patches.
np.argsort can sort the values and use the result to index the other arrays.
The code below imports pyplot as plt because importing it as * can lead to al lot of confusion.
import matplotlib.pyplot as plt
import numpy as np
Angle=[0.0, 0.0, 0.0, 0.0, 1.5526165117219184, 0.0, 1.559560844536934, 0.0, 1.5554129250143014, 1.5529410816553442, 1.5458015331759765, -0.036680787756651845, 0.0, 0.0, 0.0, 0.0, -0.017855245139552514, -0.03224688243525392, 1.5422326689561365, 0.595918005516301, -0.06731387579270513, -0.011627382956383872, 1.5515679276951895, -0.06413211500143158, 0.0, -0.6123221322275954, 0.0, 0.0, 0.13863973713415806, 0.07677189126977804, -0.021735706841792667, 0.0, -0.6099169030770674, 1.546410917622178, 0.0, 0.0, -0.24111767845146836, 0.5961991412974801, 0.014704822377851432]
plt.figure(1,figsize=(10, 6))
values, bins, patches = plt.hist(Angle, bins=30)
order = np.argsort(values)[::-1]
print("4 highest bins:", values[order][:4])
print(" their ranges:", [ (bins[i], bins[i+1]) for i in order[:4]])
for i in order[:4]:
patches[i].set_color('fuchsia')
plt.show()
Output:
4 highest bins: [21. 8. 3. 2.]
their ranges: [(-0.03315333842372081, 0.03924276080176348), (1.4871647453114498, 1.559560844536934), (-0.1055494376492051, -0.03315333842372081), (0.5460154553801537, 0.6184115546056381)]
Another example highlighting the 3 highest bins:
Angle = np.random.normal(np.tile(np.random.uniform(1, 100, 20 ), 100), 5 )
values, bins, patches = plt.hist(Angle, bins=100)

Feature matching with flann in opencv

I am working on an image search project for which i have defined/extracted the key point features using my own algorithm. Initially i extracted only single feature and tried to match using cv2.FlannBasedMatcher() and it worked fine which i have implemented as below:
Here vec is 2-d list of float values of shape (10, )
Ex:
[[0.80000000000000004, 0.69999999999999996, 0.59999999999999998, 0.44444444444444448, 0.25, 0.0, 0.5, 2.0, 0, 2.9999999999999996]
[2.25, 2.666666666666667, 3.4999999999999996, 0, 2.5, 1.0, 0.5, 0.37499999999999994, 0.20000000000000001, 0.10000000000000001]
[2.25, 2.666666666666667, 3.4999999999999996, 0, 2.5, 1.0, 0.5, 0.37499999999999994, 0.20000000000000001, 0.10000000000000001]
[2.25, 2.666666666666667, 3.4999999999999996, 0, 2.5, 1.0, 0.5, 0.37499999999999994, 0.20000000000000001, 0.10000000000000001]]
vec1 = extractFeature(img1)
vec2 = extractFeature(img2)
q1 = np.asarray(vec1, dtype=np.float32)
q2 = np.asarray(vec2, dtype=np.float32)
FLANN_INDEX_KDTREE = 0
index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees = 5)
search_params = dict(checks=50) # or pass empty dictionary
flann = cv2.FlannBasedMatcher(index_params,search_params)
matches = flann.knnMatch(q1,q2,k=2)
But now i have one more feature descriptor for each key point along with previous one but of different length.
So now my feature descriptor has shape like this:
[[[0.80000000000000004, 0.69999999999999996, 0.59999999999999998, 0.44444444444444448, 0.25, 0.0, 0.5, 2.0, 0, 2.9999999999999996],[2.06471330e-01, 1.59191645e-02, 9.17678759e-05, 1.32570314e-05, 4.58424252e-10, 1.66717250e-06,6.04810165e-11]
[[2.25, 2.666666666666667, 3.4999999999999996, 0, 2.5, 1.0, 0.5, 0.37499999999999994, 0.20000000000000001, 0.10000000000000001],[ 2.06471330e-01, 1.59191645e-02, 9.17678759e-05, 1.32570314e-05, 4.58424252e-10, 1.66717250e-06, 6.04810165e-11],
[[2.25, 2.666666666666667, 3.4999999999999996, 0, 2.5, 1.0, 0.5, 0.37499999999999994, 0.20000000000000001, 0.10000000000000001],[ 2.06471330e-01, 1.59191645e-02, 9.17678759e-05, 1.32570314e-05, 4.58424252e-10, 1.66717250e-06, 6.04810165e-11],
[[2.25, 2.666666666666667, 3.4999999999999996, 0, 2.5, 1.0, 0.5, 0.37499999999999994, 0.20000000000000001, 0.10000000000000001],[ 2.06471330e-01, 1.59191645e-02, 9.17678759e-05, 1.32570314e-05, 4.58424252e-10, 1.66717250e-06, 6.04810165e-11]]
Now since each point's feature descriptor is a list two lists(descriptors) with different length that is (10, 7, ) so in this case i am getting error:
setting an array element with a sequence.
while converting feature descriptor to numpy array of float datatype:
q1 = np.asarray(vec1, dtype=np.float32)
I understand the reason of this error is different length of lists, so i wonder What would be the right way to implement the same?
You should define a single descriptor of size 10+7=17.
This way, the space descriptor is now of 17 and you should be able to use cv2.FlannBasedMatcher.
Either create a global descriptor of the correct size desc_glob = np.zeros((nb_pts,17)) and fill it manually or find a Python way to do it. Maybe np.reshape((nb_pts,17))?
Edit:
To not favor one descriptor type over the other, you need to weight or normalize the descriptors. This is the same principle than computing a global descriptor distance from two descriptors:
dist(desc1, desc2) = dist(desc1a, desc2a) + lambda * dist(desc1b, desc2b)

Similarity Measure/Matrix for data (recommender system)- Python

I am new to machine learning and am trying to try out the following problem.
Input is 2 arrays of descriptions with same length, and output is an array of similarity scores of first string from first array compared to first string in second array etc.
Each item in the array(numpy array) is a string of description. Can you write a function find out how similar between two strings by calculating how many identical and co-occurring word IDs there are, and assign it a score (one possible weight can be based on the frequency of co-occurrence vs sum of frequency of individual word ID). Then apply the function to two arrays to get an array of scores.
Please also let me know if there are other approaches you would want to to consider as well.
Thanks!
Data:
array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',
'18/19/20/21/22/23/24/25',
'26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',
'5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',
'57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',
'70/71/72/73/74/75/76/77',
'78/79/80/81/82/83/84/85/86/87/88/89/90/91',
'33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',
'104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',
'117/118/119/120/121/12/122/123/124/125',
'14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',
'137/138/139/140/141/142',
'143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',
'160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',
'172/173/174/175/176/177/73/178/104/179/180/179/181/173',
'182/144/183/179/73',
'184/163/68/185/163/8/186/187/188/54/189/190/191',
'181/192/0/1/193/194/22/195',
'113/196/197/198/68/199/68/200/201/202/203/201',
'204/205/206/207/208/209/68/200',
'163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',
'220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',
'214/228/5/6/5/215/228/228/229',
'230/231/232/233/122/215/128/214/128/234/234',
'235/236/191/237/92/93/238/239',
'13/14/44/44/240/241/242/49/54/243/244/245/55/56',
'220/21/246/38/247/201/248/73/160/249/250/203/201',
'214/49/251/252/253/254/255/256/257/258'],
dtype='|S127')
array(['151/308/309/310/311/215/312/160/313/214/49/12',
'314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',
'324/325/62/220/326/194/327/328/218/76/241/329',
'330/29/22/103/331/314/68/80/49',
'78/332/85/96/97/227/333/4/334/188',
'57/335/336/34/187/337/21/338/212/213/339/340',
'341/342/167/343/8/254/154/61/344',
'2/292/345/346/42/347/348/348/100/349/202/161/263',
'283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',
'137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',
'23/363/10/364/289/68/123/354/355',
'188/28/365/149/366/98/367/368/369/370/371/372/368',
'373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',
'179/376/377/378/179/87/88/379/20',
'380/85/381/333/382/215/128/383/384', '385/129/386/387/388',
'389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',
'397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',
'77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',
'129/295/90/259/38/39/119/414/415/416/14/318/417/418',
'419/420/421/422/423/23/424/241/421/425/58',
'426/244/427/5/428/49/76/429/430/431',
'257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',
'439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',
'385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'],
dtype='|S127')
The following code should facilitate you with what you need in Python 3.x
import numpy as np
from collections import Counter
def jaccardSim(c1, c2):
cU = c1 | c2
cI = c1 & c2
sim = sum(cI.values()) / sum(cU.values())
return sim
def byteArraySim(b1, b2):
cA = [Counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b1))]
cB = [Counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b2))]
# Assuming both 'a' and 'b' are in the same length
cSim = [jaccardSim(cA[i], cB[i]) for i in range(len(a))]
return cSim # Array of similarities
Jaccard Similarity score is used in this implementation. You may other scores, such as cosine or hamming, to your liking.
Assuming that the arrays are stored in variables a and b, the resulting function byteArraySim(a,b) outputs the following similarity scores:
[0.0,
0.0,
0.0,
0.038461538461538464,
0.0,
0.041666666666666664,
0.0,
0.0,
0.0,
0.08,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.058823529411764705,
0.0,
0.0,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0]

Floating point problems in asymptotic functions approaching zero - Python

New to python coming from MATLAB.
I am using a hyperbolic tangent truncation of a magnitude-scale function.
I encounter my problem when applying the 0.5 * math.tanh(r/rE-r0) + 0.5 function onto an array of range values r = np.arange(0.1,100.01,0.01). I get several 0.0 values for the function on the side approaching zero, which cause domain issues when I perform the logarithm:
P1 = [ (0.5*m.tanh(x / rE + r0 ) + 0.5) for x in r] # truncation function
I use this work-around:
P1 = [ -m.log10(x) if x!=0.0 else np.inf for x in P1 ]
which is sufficient for what I am doing but is a bit of a band-aid solution.
As requested for mathematical explicitness:
In astronomy, the magnitude scale works roughly as such:
mu = -2.5log(flux) + mzp # apparent magnitude
where mzp is the magnitude at which one would see 1 photon per second. Therefore, greater fluxes equate to smaller (or more negative) apparent magnitude. I am making models for sources which use multiple component functions. Ex. two sersic functions with different sersic indices with a P1 outer truncation on the inner component and a 1-P1 inner truncation on the outer component. This way, when adding the truncation function to each component, the magnitude as defined by radius, will become very large because of how small mu1-2.5*log(P1) gets as P1 asymptotically approaches zero.
TLDR: What I would like to know is if there is a way of preserving floating points whose accuracy is insufficient to be distinguishable from zero (in particular in the results of functions that asymptotically approach zero). This important because when taking the logarithm of such numbers a domain error is the result.
The last number before the output in the non-logarithmic P1 starts reading zero is 5.551115123125783e-17, which is a common floating point arithmetic rounding error result where the desired result should be zero.
Any input would be greatly appreciated.
#user:Dan
without putting my whole script:
xc1,yc1 = 103.5150,102.5461;
Ee1 = 23.6781;
re1 = 10.0728*0.187;
n1 = 4.0234;
# radial brightness profile (magnitudes -- really surface brightness but fine in ex.)
mu1 = [ Ee1 + 2.5/m.log(10)*bn(n1)*((x/re1)**(1.0/n1) - 1) for x in r];
# outer truncation
rb1 = 8.0121
drs1 = 11.4792
P1 = [ (0.5*m.tanh( (2.0 - B(rb1,drs1) ) * x / rb1 + B(rb1,drs1) ) + 0.5) for x in r]
P1 = [ -2.5*m.log10(x) if x!=0.0 else np.inf for x in P1 ] # band-aid for problem
mu1t = [x+y for x,y in zip(P1,mu1)] # m1 truncated by P1
where bn(n1)=7.72 and B(rb1,drs1) = 2.65 - 4.98 * ( r_b1 / (-drs1) );
mu1 is the magnitude profile of the component to be truncated. P1 is the truncation function. Many of the final entries for P1 are zero, which is due to the floating points being undistinguished from zero due to the floating point accuracy.
An easy way to see the problem:
>>> r = np.arange(0,101,1)
>>> P1 = [0.5*m.tanh(-x)+0.5 for x in r]
>>> P1
[0.5, 0.11920292202211757, 0.01798620996209155, 0.002472623156634768, 0.000335350130466483, 4.539786870244589e-05, 6.144174602207286e-06, 8.315280276560699e-07, 1.1253516207787584e-07, 1.5229979499764568e-08, 2.0611536366565986e-09, 2.789468100949932e-10, 3.775135759553905e-11, 5.109079825871277e-12, 6.914468997365475e-13, 9.35918009759007e-14, 1.2656542480726785e-14, 1.7208456881689926e-15, 2.220446049250313e-16, 5.551115123125783e-17, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Note also the floats before zeros.
Recall that the hyperbolic tangent can be expressed as (1-e^{-2x})/(1+e^{-2x}). With a bit of algebra, we can get that 0.5*tanh(x)-0.5 (the negative of your function) is equal to e^{-2x}/(1+e^{-2x}). The logarithm of this would be -2*x-log(1+exp(-2*x)), which would work and be stable everywhere.
That is, I recommend you replace:
P1 = [ (0.5*m.tanh( (2.0 - B(rb1,drs1) ) * x / rb1 + B(rb1,drs1) ) + 0.5) for x in r]
P1 = [ -2.5*m.log10(x) if x!=0.0 else np.inf for x in P1 ] # band-aid for problem
With this simpler and more stable way of doing it:
r = np.arange(0.1,100.01,0.01)
#r and xvals are numpy arrays, so numpy functions can be applied in one step
xvals=(2.0 - B(rb1,drs1) ) * r / rb1 + B(rb1,drs1)
P1=2*xvals+np.log1p(np.exp(-2*xvals))
Two things you can try.
(1) brute force approach: find a variable-precision float arithmetic package and use that instead of built-in fixed precision. I am playing with your problem in Maxima [1] and I find that I have to increase the float precision quite a lot in order to avoid underflow, but it is possible. I can post the Maxima code if you want. I would imagine that there is some suitable variable-precision float library for Python.
(2) approximate log((1/2)(1 + tanh(-x)) with a Taylor series or some other kind of approximation in order to avoid the log(tanh(...)) altogether.
[1] http://maxima.sourceforge.net

When normalizing list of numbers, result is all zeros

I need to normalise the values in a list to produce a (cumulative) probability distribution, but currently I'm just getting 0s out.
Here's what I'm doing:
tests = []
#some code to populate tests which simulates
count = [x[0] for x in tests]
found = [x[1] for x in tests]
found.sort()
num = Counter(found)
freqs = [x for x in num.values()]
cumsum = [sum(item for item in freqs[0:rank+1]) for rank in xrange(len(freqs))]
normcumsum = [float(x/numtests) for x in cumsum]
Currently cumsum and normcumsum are:
cumsum = [1, 2, 6, 12, 28, 39, 64, 85, 96, 98, 99, 100]
normcumsum = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
How do I get normcumsum to contain cumsum/100?
N>B Yes, these variable names are a little stupid.
x/numtests will always return 0, much like 1/2 will always return 0, because you're doing integer division
You must do float(x)/numtests, or do:
from __future__ import division
This is only necessary in python2, not python3.
Demo:
>>> [1/2, 3/2, 5/2]
[0, 1, 2]
>>> from __future__ import division
>>> [1/2, 3/2, 5/2]
[0.5, 1.5, 2.5]
when two parts of your division are integer, automatically python round the result and make it integer, you need to make one of them float. for example change "float(x/numtests) " to "float(float(x)/numtests)"

Categories

Resources