How to get K max values from a histogram? - python

I want to extract let say the 3 max values in a matplotlib histogram.
There are a lot of ways to extract the (unique) max value in a histogram, but I don't find anything about extract the 2-3 or 4 max values in a histogram.
I also want it to be automatic (not specific to the following case).
Here is my data and my code:
from matplotlib.pyplot import *
Angle=[0.0, 0.0, 0.0, 0.0, 1.5526165117219184, 0.0, 1.559560844536934, 0.0, 1.5554129250143014, 1.5529410816553442, 1.5458015331759765, -0.036680787756651845, 0.0, 0.0, 0.0, 0.0, -0.017855245139552514, -0.03224688243525392, 1.5422326689561365, 0.595918005516301, -0.06731387579270513, -0.011627382956383872, 1.5515679276951895, -0.06413211500143158, 0.0, -0.6123221322275954, 0.0, 0.0, 0.13863973713415806, 0.07677189126977804, -0.021735706841792667, 0.0, -0.6099169030770674, 1.546410917622178, 0.0, 0.0, -0.24111767845146836, 0.5961991412974801, 0.014704822377851432]
figure(1,figsize=(16,10))
plt.hist(Angle, bins=100,label='Angle')
show()

plt.hist outputs the bin heights, the bin boundaries and the rectangular patches.
np.argsort can sort the values and use the result to index the other arrays.
The code below imports pyplot as plt because importing it as * can lead to al lot of confusion.
import matplotlib.pyplot as plt
import numpy as np
Angle=[0.0, 0.0, 0.0, 0.0, 1.5526165117219184, 0.0, 1.559560844536934, 0.0, 1.5554129250143014, 1.5529410816553442, 1.5458015331759765, -0.036680787756651845, 0.0, 0.0, 0.0, 0.0, -0.017855245139552514, -0.03224688243525392, 1.5422326689561365, 0.595918005516301, -0.06731387579270513, -0.011627382956383872, 1.5515679276951895, -0.06413211500143158, 0.0, -0.6123221322275954, 0.0, 0.0, 0.13863973713415806, 0.07677189126977804, -0.021735706841792667, 0.0, -0.6099169030770674, 1.546410917622178, 0.0, 0.0, -0.24111767845146836, 0.5961991412974801, 0.014704822377851432]
plt.figure(1,figsize=(10, 6))
values, bins, patches = plt.hist(Angle, bins=30)
order = np.argsort(values)[::-1]
print("4 highest bins:", values[order][:4])
print(" their ranges:", [ (bins[i], bins[i+1]) for i in order[:4]])
for i in order[:4]:
patches[i].set_color('fuchsia')
plt.show()
Output:
4 highest bins: [21. 8. 3. 2.]
their ranges: [(-0.03315333842372081, 0.03924276080176348), (1.4871647453114498, 1.559560844536934), (-0.1055494376492051, -0.03315333842372081), (0.5460154553801537, 0.6184115546056381)]
Another example highlighting the 3 highest bins:
Angle = np.random.normal(np.tile(np.random.uniform(1, 100, 20 ), 100), 5 )
values, bins, patches = plt.hist(Angle, bins=100)

Related

Pandas Dataframe plot method not interpreting color parameter properly

I have the following colors variable that contains the RGBA values of the colours I would like my bar graphs to be.
colors = [(1.0, 0.0, 0.0, 1.0),
(0.9625172106584595, 0.0, 0.0, 1.0),
(0.8426095407791366, 0.0, 0.0, 1.0),
(0.7803353041224589, 0.0, 0.0, 1.0),
(0.7778812667044626, 0.0, 0.0, 1.0),
(0.7527658540536163, 0.0, 0.0, 1.0),
(0.7322264517696606, 0.0, 0.0, 1.0),
(0.6348343727221187, 0.0, 0.0, 1.0),
(0.5364622985340568, 0.0, 0.0, 1.0),
(0.5, 0.0, 0.0, 1.0)]
However, it seems like all of the bar graphs are only taking the colour of the first value
reg_graph = reg_df.plot(kind="barh",figsize=(10,6),color=colors)
reg_graph.set_title("Number of records by region",fontsize=15)
Any ideas on how to resolve this issue?
-------- EDIT ---------
I have included more code from my jupyter notebook
reg_df = pd.DataFrame({"Records per region":df["Region"].value_counts()})
reg_df["Region Name"] = reg_df.index.to_series().map(codes_to_names_map["Region"])
#reg_df = reg_df.set_index("Region Name")
reg_df = reg_df.sort_values("Records per region")
reg_df
#Applying a colour gradient for better data visualization
max_value = reg_df["Records per region"].max()
min_value = reg_df["Records per region"].min()
def calculate_color(value):
MAX_COLOR = (0.5,0,0)
MIN_COLOR = (1,0,0)
diff_in_color = tuple(map(lambda i, j: i - j, MAX_COLOR, MIN_COLOR))
calculate_diff = tuple((value-min_value)/(max_value-min_value) * i for i in diff_in_color)
return tuple(map(lambda i, j: i + j, calculate_diff, MIN_COLOR))
colors = [i for i in reg_df["Records per region"].apply(calculate_color)]
colors
# Plotting the bar graph
reg_graph = reg_df.plot(kind="barh",figsize=(10,6),color=colors)
reg_graph.set_title("Number of records by region",fontsize=15)
This final bit of code produces the graph in red despite the colors array being different RGB values.

Similarity Measure/Matrix for data (recommender system)- Python

I am new to machine learning and am trying to try out the following problem.
Input is 2 arrays of descriptions with same length, and output is an array of similarity scores of first string from first array compared to first string in second array etc.
Each item in the array(numpy array) is a string of description. Can you write a function find out how similar between two strings by calculating how many identical and co-occurring word IDs there are, and assign it a score (one possible weight can be based on the frequency of co-occurrence vs sum of frequency of individual word ID). Then apply the function to two arrays to get an array of scores.
Please also let me know if there are other approaches you would want to to consider as well.
Thanks!
Data:
array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',
'18/19/20/21/22/23/24/25',
'26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',
'5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',
'57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',
'70/71/72/73/74/75/76/77',
'78/79/80/81/82/83/84/85/86/87/88/89/90/91',
'33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',
'104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',
'117/118/119/120/121/12/122/123/124/125',
'14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',
'137/138/139/140/141/142',
'143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',
'160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',
'172/173/174/175/176/177/73/178/104/179/180/179/181/173',
'182/144/183/179/73',
'184/163/68/185/163/8/186/187/188/54/189/190/191',
'181/192/0/1/193/194/22/195',
'113/196/197/198/68/199/68/200/201/202/203/201',
'204/205/206/207/208/209/68/200',
'163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',
'220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',
'214/228/5/6/5/215/228/228/229',
'230/231/232/233/122/215/128/214/128/234/234',
'235/236/191/237/92/93/238/239',
'13/14/44/44/240/241/242/49/54/243/244/245/55/56',
'220/21/246/38/247/201/248/73/160/249/250/203/201',
'214/49/251/252/253/254/255/256/257/258'],
dtype='|S127')
array(['151/308/309/310/311/215/312/160/313/214/49/12',
'314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',
'324/325/62/220/326/194/327/328/218/76/241/329',
'330/29/22/103/331/314/68/80/49',
'78/332/85/96/97/227/333/4/334/188',
'57/335/336/34/187/337/21/338/212/213/339/340',
'341/342/167/343/8/254/154/61/344',
'2/292/345/346/42/347/348/348/100/349/202/161/263',
'283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',
'137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',
'23/363/10/364/289/68/123/354/355',
'188/28/365/149/366/98/367/368/369/370/371/372/368',
'373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',
'179/376/377/378/179/87/88/379/20',
'380/85/381/333/382/215/128/383/384', '385/129/386/387/388',
'389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',
'397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',
'77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',
'129/295/90/259/38/39/119/414/415/416/14/318/417/418',
'419/420/421/422/423/23/424/241/421/425/58',
'426/244/427/5/428/49/76/429/430/431',
'257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',
'439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',
'385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'],
dtype='|S127')
The following code should facilitate you with what you need in Python 3.x
import numpy as np
from collections import Counter
def jaccardSim(c1, c2):
cU = c1 | c2
cI = c1 & c2
sim = sum(cI.values()) / sum(cU.values())
return sim
def byteArraySim(b1, b2):
cA = [Counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b1))]
cB = [Counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b2))]
# Assuming both 'a' and 'b' are in the same length
cSim = [jaccardSim(cA[i], cB[i]) for i in range(len(a))]
return cSim # Array of similarities
Jaccard Similarity score is used in this implementation. You may other scores, such as cosine or hamming, to your liking.
Assuming that the arrays are stored in variables a and b, the resulting function byteArraySim(a,b) outputs the following similarity scores:
[0.0,
0.0,
0.0,
0.038461538461538464,
0.0,
0.041666666666666664,
0.0,
0.0,
0.0,
0.08,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.058823529411764705,
0.0,
0.0,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0]

Using Neural Net weights derived in Matlab on other programming language

I'm having trouble replicating a neural net created in matlab using python. Its a {9,8,4} network. Below is the original output in matlab and python respectively
0.00187283763854096 0.00280257304094145 0.00709416898379967 0.00474275971385824 0.000545071722266366
0.0520122170317888 0.0402746073491970 0.0179208146529717 0.0245726107168336 0.230693355244371
0.430695009441386 0.434492291029203 0.410151021812136 0.416871471927059 0.469873849186641
0.562954025662924 0.539410486293765 0.666336481449288 0.637779009735872 0.284564488176231
[1.0, -1.0, -0.6875955603907775, -0.9999999426232321]
[1.0, -1.0, 0.5569364789701737, -0.9994593106654553]
[1.0, -1.0, 0.5022468075847347, -0.999780120038859]
[1.0, -1.0, 0.4924691499951816, -0.9997110849203137]
[1.0, -1.0, 0.5945295051094253, -0.9991584098381949]
I obtained the input and layered weights using net2.IW{1}, net2.LW{2}. The bias was obtained as follows; net2.b{1} and net2.b{2}.
Without using bias, I got something that looks close:
[-0.6296705512038354, 0.9890465283687858, 0.1368924025968622, 0.5426776395855755]
[-0.05171165478856995, 0.2973298654798701, 0.02897695903082293, 0.0499820714219222]
[-0.10046933055782481, 0.40531232885083035, 0.033067381241777244, 0.06585830703439044]
[0.03167268710874907, 0.5485036035542894, 0.10579223668518502, 0.015475934153332364]
[0.006502829360007152, 0.22928662468119648, 0.03788967208701787, 0.012868192806301859]
Hence I think the problem may lie in the bias; I'm not quite sure though.
Python implementation with weights taken from Matlab
def sigmoid(x):
return math.tanh(x)
def NN(inputs, bias1, bias2):
wsum=[sum([(x*y) for x,y in zip(inputs[0],row)])for row in inputweights]
wsbias=[(x + y) for x,y in zip(wsum,bias1)]
inputactivation=[sigmoid(k) for k in wsbias]
wsoutput=[sum([(x*y) for x,y in zip(inputactivtion,row)])for row in hiddenweights]
wsbias2=[(x + y) for x,y in zip(wsoutput,bias2)]
outputactivation=[sigmoid(k) for k in wsbias2]
return 'output' outputactivation
I would really appreciate any solution that works.
Below are input and Layered weights as well as input and layered bias obtained.
IW=[[-9.1964, -2.3015, 0.2493, 3.3648, -2.6015, -0.0795, -11.2356, 4.6861,-0.8360],
[6.0201, -1.8708, 2.7844, 0.2419, -1.1808, -8.6800, 5.8519, -5.2958, 5.3233],
[0.8597, 0.8644, -0.6913, -0.0397, 0.0619, 0.4506, 1.0687, 0.4090, -0.2874],
[2.9459, 3.2596, 2.2859, 1.1933, 2.9675, -9.6017, 3.5893, 1.4808, -7.5311],
[-0.1533, -1.4806, -2.3748, 0.8059, -0.5502, -1.0447, -0.5920, -1.1667, -1.1447],
[4.7185, -9.2097, 1.1001, -0.0173, 1.4929, 0.3884, 3.7674, 6.3459, -4.2845],
[-16.4031, 8.1351, 2.0689, 2.1267, 6.2093, -8.3875, -15.8493, -0.6096, 2.9214],
[1.7329, 0.1797, 0.1500, 9.1616, -1.7226, 0.9479, 3.2542, -24.4003, -4.2790]]
LW=[[-18.5985, 12.2366, -0.8833, -1.6382, 4.6281, 8.1221, -23.7587, -0.8589],
[12.0462, -11.5464, 6.9612, -10.8562, -7.0647, 5.6653, 16.2527, -7.6119],
[12.4176, 0.9808, 0.7650, -2.9434, -0.2765, -3.0689, -3.1528, 3.0389],
[5.7570, 7.7584, -6.9550, -2.3679, -1.4884, -11.0668, 2.6764, 26.5427]]
bias1=[-1.7639, -1.2599, -0.7560, 0.2520,-0.2520,0.7560, -1.2599, -1.7639]
bias2= [0.2129,-8.1812, 0.0202,4.4512]
My inputs
[[0.0, 0.0, 0.0414444125526, 0.0, 0.0, 0.00670501464516, 0.0, 0.0, 0.0313140652051], [0.0, 0.0, 0.0, 1.0]]
[[0.0, 0.0, 0.00398243636152, 0.0, 0.0, 0.000863557858377, 0.0, 0.0, 0.00356406423776], [0.0, 0.0, 0.0, 1.0]]
[[0.0, 0.0, 0.00440892765754, 0.0, 0.0, 0.000725737283104, 0.0, 0.0, 0.00543503005753], [0.0, 0.0, 0.0, 1.0]]
[[0.0, 0.0, 0.00565322288091, 0.0, 0.0, 0.00236630383341, 0.0, 0.0, 0.00642911490856], [0.0, 0.0, 0.0, 1.0]]
[[0.0, 0.0, 0.00250332223564, 0.0, 0.0, 0.000926998841251, 0.0, 0.0, 0.00241792804103], [0.0, 0.0, 0.0, 1.0]]
Thanks for your suggestions.

Floating point problems in asymptotic functions approaching zero - Python

New to python coming from MATLAB.
I am using a hyperbolic tangent truncation of a magnitude-scale function.
I encounter my problem when applying the 0.5 * math.tanh(r/rE-r0) + 0.5 function onto an array of range values r = np.arange(0.1,100.01,0.01). I get several 0.0 values for the function on the side approaching zero, which cause domain issues when I perform the logarithm:
P1 = [ (0.5*m.tanh(x / rE + r0 ) + 0.5) for x in r] # truncation function
I use this work-around:
P1 = [ -m.log10(x) if x!=0.0 else np.inf for x in P1 ]
which is sufficient for what I am doing but is a bit of a band-aid solution.
As requested for mathematical explicitness:
In astronomy, the magnitude scale works roughly as such:
mu = -2.5log(flux) + mzp # apparent magnitude
where mzp is the magnitude at which one would see 1 photon per second. Therefore, greater fluxes equate to smaller (or more negative) apparent magnitude. I am making models for sources which use multiple component functions. Ex. two sersic functions with different sersic indices with a P1 outer truncation on the inner component and a 1-P1 inner truncation on the outer component. This way, when adding the truncation function to each component, the magnitude as defined by radius, will become very large because of how small mu1-2.5*log(P1) gets as P1 asymptotically approaches zero.
TLDR: What I would like to know is if there is a way of preserving floating points whose accuracy is insufficient to be distinguishable from zero (in particular in the results of functions that asymptotically approach zero). This important because when taking the logarithm of such numbers a domain error is the result.
The last number before the output in the non-logarithmic P1 starts reading zero is 5.551115123125783e-17, which is a common floating point arithmetic rounding error result where the desired result should be zero.
Any input would be greatly appreciated.
#user:Dan
without putting my whole script:
xc1,yc1 = 103.5150,102.5461;
Ee1 = 23.6781;
re1 = 10.0728*0.187;
n1 = 4.0234;
# radial brightness profile (magnitudes -- really surface brightness but fine in ex.)
mu1 = [ Ee1 + 2.5/m.log(10)*bn(n1)*((x/re1)**(1.0/n1) - 1) for x in r];
# outer truncation
rb1 = 8.0121
drs1 = 11.4792
P1 = [ (0.5*m.tanh( (2.0 - B(rb1,drs1) ) * x / rb1 + B(rb1,drs1) ) + 0.5) for x in r]
P1 = [ -2.5*m.log10(x) if x!=0.0 else np.inf for x in P1 ] # band-aid for problem
mu1t = [x+y for x,y in zip(P1,mu1)] # m1 truncated by P1
where bn(n1)=7.72 and B(rb1,drs1) = 2.65 - 4.98 * ( r_b1 / (-drs1) );
mu1 is the magnitude profile of the component to be truncated. P1 is the truncation function. Many of the final entries for P1 are zero, which is due to the floating points being undistinguished from zero due to the floating point accuracy.
An easy way to see the problem:
>>> r = np.arange(0,101,1)
>>> P1 = [0.5*m.tanh(-x)+0.5 for x in r]
>>> P1
[0.5, 0.11920292202211757, 0.01798620996209155, 0.002472623156634768, 0.000335350130466483, 4.539786870244589e-05, 6.144174602207286e-06, 8.315280276560699e-07, 1.1253516207787584e-07, 1.5229979499764568e-08, 2.0611536366565986e-09, 2.789468100949932e-10, 3.775135759553905e-11, 5.109079825871277e-12, 6.914468997365475e-13, 9.35918009759007e-14, 1.2656542480726785e-14, 1.7208456881689926e-15, 2.220446049250313e-16, 5.551115123125783e-17, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Note also the floats before zeros.
Recall that the hyperbolic tangent can be expressed as (1-e^{-2x})/(1+e^{-2x}). With a bit of algebra, we can get that 0.5*tanh(x)-0.5 (the negative of your function) is equal to e^{-2x}/(1+e^{-2x}). The logarithm of this would be -2*x-log(1+exp(-2*x)), which would work and be stable everywhere.
That is, I recommend you replace:
P1 = [ (0.5*m.tanh( (2.0 - B(rb1,drs1) ) * x / rb1 + B(rb1,drs1) ) + 0.5) for x in r]
P1 = [ -2.5*m.log10(x) if x!=0.0 else np.inf for x in P1 ] # band-aid for problem
With this simpler and more stable way of doing it:
r = np.arange(0.1,100.01,0.01)
#r and xvals are numpy arrays, so numpy functions can be applied in one step
xvals=(2.0 - B(rb1,drs1) ) * r / rb1 + B(rb1,drs1)
P1=2*xvals+np.log1p(np.exp(-2*xvals))
Two things you can try.
(1) brute force approach: find a variable-precision float arithmetic package and use that instead of built-in fixed precision. I am playing with your problem in Maxima [1] and I find that I have to increase the float precision quite a lot in order to avoid underflow, but it is possible. I can post the Maxima code if you want. I would imagine that there is some suitable variable-precision float library for Python.
(2) approximate log((1/2)(1 + tanh(-x)) with a Taylor series or some other kind of approximation in order to avoid the log(tanh(...)) altogether.
[1] http://maxima.sourceforge.net

Input....and every element must be continuous Error using sklearn MultinominalHMM

Trying to create a left right discrete HMM in sklearn to recognize words from recognized characters. symbol set is all " " + 26 letters for 27 total symbols.
import numpy as np
from sklearn import hmm
# alphabet is symbols
symbols = [' ','a','b','c','d','e','f','g','h','i','j', #0-10
'k','l','m','n','o','p','q','r','s','t', #11-20
'u','v','w','x','y','z'] #21-26
num_symbols = len(symbols)
# words up to 6 letters
n_states = 6
obsONE = np.array([ [0,0,15,14,5,0], # __one_
[15,14,5,0,0,0], # one___
[0,0,0,15,14,5], # ___one
[0,15,14,5,0,0], # _one__
[0,0,16,14,5,0], # __pne_
[15,14,3,0,0,0], # onc___
[0,0,0,15,13,5], # ___ome
[0,15,14,5,0,0], # _one__
[0,0,15,14,5,0], # __one_
[15,14,5,0,10,15], # one_jo
[1,14,0,15,14,5], # an_one
[0,15,14,5,0,16], # _one_p
[20,0,15,14,5,0], # t_one_
[15,14,5,0,10,15], # one_jo
[21,20,0,15,14,5], # ut_one
[0,15,14,5,0,20], # _one_t
[21,0,15,14,5,0], # u_one_
[15,14,5,0,10,15], # one_jo
[0,0,0,15,14,5], # an_one
[0,15,14,5,0,26], # _one_z
[5,20,0,15,14,5] ])
pi = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 0.0]) # initial state is the left one always
A = np.array([[0.0, 1.0, 0.0, 0.0, 0.0, 0.0], # node 1 goes to node 2
[0.0, 0.5, 0.5, 0.0, 0.0, 0.0], # node 2 can self loop or goto 3
[0.0, 0.0, 0.5, 0.5, 0.0, 0.0], # node 3 can self loop or goto 4
[0.0, 0.0, 0.0, 0.5, 0.5, 0.0], # node 4 can self loop or goto 5
[0.0, 0.0, 0.0, 0.0, 0.5, 0.5], # node 5 can self loop or goto 6
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]]) # node 6 goes to node 1
model = hmm.MultinomialHMM(n_components=n_states,
startprob=pi, # this is the start matrix, pi
transmat=A, # this is the transition matrix, A
params='e', # update e in during training (aka B)
init_params='ste') # initialize with s,t,e
model.n_symbols = num_symbols
model.fit(obsONE)
But I get ValueError: Input must be both positive integer array and every element must be continuous.
The code seems to directly want observations to be implemented as [0,1,2,3,4,5]
How should I set this up to get to the HMM model that I want???
I faced the same problem. It seems to me that the observation sequence that is input doesn't contain some characters from the vocabulary. So instead of assigning numbers to them statically, you can assign numbers to characters after finding out what characters are present in the observation sequence.
(eg.)
Suppose
Word = 'apzaqb'
use
Symbols = ['a','b','p','q','z'] for numbering
(ie.) ObsOne = np.array([0,2,4,0,3,1])
instead of
Symbols = ['','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for numbering
(ie.) ObsOne = np.array([1,16,26,1,17,2])

Categories

Resources