How to fit multiple sequences with GMMHMM?

How to fit multiple sequences with GMMHMM? - python

I have a problem with the Python hmmlearn library. This is that I have several training sets and I would like to have one Gaussian mixture hmm model to fit them.
Here is an example working with multiple sequences.
X = np.concatenate([X1, X2])
lengths = [len(X1), len(X2)]
hmm.GaussianHMM(n_components=3).fit(X, lengths)
When I change GaussianHMM to GMMHMM, it returns the following error:
hmm.GMMHMM(n_components=3).fit(X, lengths)
Traceback (most recent call last):
File "C:\Users\Cody\workspace\QuickSilver_HMT\hmm_list_sqlite.py", line 141, in hmm_list_pickle
hmm.GMMHMM(n_components=3).fit(X, lengths)
File "build\bdist.win32\egg\hmmlearn\hmm.py", line 998, in fit
raise ValueError("'lengths' argument is not supported yet")
ValueError: 'lengths' argument is not supported yet
How can one fit multiple sequences with GMMHMM?

The current master version contains a re-write of GMMHMM which did not support multiple sequences at some point. Now it does, so updating should help, as #ppasler suggested.
The re-write is still a work-in-progress. Please report any issues you encounter on the hmmlearn issue tracker.

Related

Applying K nearest neighbors algorithm causes issue with method train

when i tried to implement k nearest Neighbors to my training datasets i have been created same as this photo
Python version 3.7.6
OpenCv version 4.2.0
enter image description here
and same as this code
but instead of training only hand written numbers i have done it for chars and numbers based on font type i have done all the steps very well and all generated arrays is perfect
only KNN.train has problem i found some posts before said it has problem with old versions of Python but at same time i heard that cv2.ml.KNearest_create() still work does ive dont something wrong
# KNN
knn = cv2.ml.KNearest_create()
knn.train(cells, cv2.ml.ROW_SAMPLE, cells_labels)
ret, result, neighbours, dist = knn.findNearest(test_cells, k=3)
it caused me strange error does it in compatible with python3.7.6
Traceback (most recent cakk kast):
File "knn-apply.py", line 38, in <module>
knn.train(cells.ml.ROW_SAMPLE, cells_labels)
RypeError: Expected Ptr<cv::UMat> for argument 'responses'

Cannot load Doc2vec object using gensim

I am trying to load a pre-trained Doc2vec model using gensim and use it to map a paragraph to a vector. I am referring to https://github.com/jhlau/doc2vec and the pre-trained model I downloaded is the English Wikipedia DBOW, which is also in the same link. However, when I load the Doc2vec model on wikipedia and infer vectors using the following code:
import gensim.models as g
import codecs
model="wiki_sg/word2vec.bin"
test_docs="test_docs.txt"
output_file="test_vectors.txt"
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000
#load model
test_docs = [x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines()]
m = g.Doc2Vec.load(model)
#infer test vectors
output = open(output_file, "w")
for d in test_docs:
output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
output.flush()
output.close()
I get an error:
/Users/zhangji/Desktop/CSE547/Project/NLP/venv/lib/python2.7/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
File "/Users/zhangji/Desktop/CSE547/Project/NLP/AbstractMapping.py", line 19, in <module>
output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'
I know there are couple of threads regarding the infer_vector issue on stack overflow, but none of them resolved my problem. I downloaded the gensim package using
pip install git+https://github.com/jhlau/gensim
In addition, after I looked at the source code in gensim package, I found that when I use Doc2vec.load(), the Doc2vec class doesn't really have a load() function by itself, but since it is a subclass of Word2vec, it calls the super method of load() in Word2vec and then make the model m a Word2vec object. However, the infer_vector() function is unique to Doc2vec and does not exist in Word2vec, and that's why it is causing the error. I also tried casting the model m to a Doc2vec, but I got this error:
>>> g.Doc2Vec(m)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 599, in __init__
self.build_vocab(documents, trim_rule=trim_rule)
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 513, in build_vocab
self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 635, in scan_vocab
for document_no, document in enumerate(documents):
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 1367, in __getitem__
return vstack([self.syn0[self.vocab[word].index] for word in words])
TypeError: 'int' object is not iterable
In fact, all I want with gensim for now is to convert a paragraph to a vector using a pre-trained model that works well on academic articles. For some reasons I don't want to train the models on my own. I would be really grateful if someone can help me resolve the issue.
Btw, I am using python2.7, and the current gensim version is 0.12.4.
Thanks!

I would avoid using either the 4-year-old nonstandard gensim fork at https://github.com/jhlau/doc2vec, or any 4-year-old saved models that only load with such code.
The Wikipedia DBOW model there is also suspiciously small at 1.4GB. Wikipedia had well over 4 million articles even 4 years ago, and a 300-dimensional Doc2Vec model trained to have doc-vectors for the 4 million articles would be at least 4000000 articles * 300 dimensions * 4 bytes/dimension = 4.8GB in size, not even counting other parts of the model. (So, that download is clearly not the 4.3M doc, 300-dimensional model mentioned in the associated paper – but something that's been truncated in other unclear ways.)
The current gensim version is 3.8.3, released a few weeks ago.
It'd likely take a bit of tinkering, and an overnight or more runtime, to build your own Doc2Vec model using current code and a current Wikipedia dump - but then you'd be on modern supported code, with a modern model that better understands words coming into use in the last 4 years. (And, if you trained a model on a corpus of the exact kind of documents of interest to you – such as academic articles – the vocabulary, word-senses, and match to your own text-preprocessing to be used on later inferred documents will all be better.)
There's a Jupyter notebook example of building a Doc2Vec model from Wikipedia that either functional or very-close-to-functional inside the gensim source tree at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

Dilation image-recognition algorithms cityScapes model

I am trying to use this image-recognition algorithms using cityScapes model
https://github.com/fyu/dilation
However, I keep on getting the following error:
- bash-4.2$ python predict.py cityscapes sunny_1336601.png --gpu 0
Using GPU 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
Traceback (most recent call last):
File "predict.py", line 133, in <module>
main()
File "predict.py", line 129, in main
predict(args.dataset, args.input_path, args.output_path)
File "predict.py", line 98, in predict
color_image = dataset.palette[prediction.ravel()].reshape(image_size)
ValueError: cannot reshape array of size 12582912 into shape (1090,1920,3)
I tried reshaping the image to every common resolution I could think of, including 640x480, but I have been getting the same error.
Any help or tips is highly appreciated.
Thanks!

I don't have enough reputation to comment, so I am posting my hunch as an answer (forgive me if I'm wrong) : the given size 12582912 has to be a product of the three numbers in the tuple. A quick factorisation showed 12582912 = 1024*768*16 = 2048*1536*4 So, if the images is a 4-channel image, the resolution is 2048 x 1536 which is in standard 4:3 aspect ratio.

It turns out that Cityscapes model only takes a specific size: The width should be twice the length.

If you know Python well, you will see that ValueError is internal code error. It has nothing to do with missing dependencies or environment.
It has to do with the fact that the image was one total size first, and then it's being casted to array and then back into another dimensions.
That is not something that can be fixed or should be fixed by tempering with input data, but with addressing the bug in the provided library itself.
It is very common to have this kind of restrictions with NN classifier. Because once layers are trained, they can't be changed and the input must be very specific. Of course, it still can be "cooked" before giving it to NN but it's usually nondestructive/basic scaling, so the proportions must be preserved, which is what the library does wrong.

Memory Error: Numpy.random.normal

In theano the following code snippet is throwing Memory error:
self.w = theano.shared(
np.asarray(
np.random.normal(
loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
dtype=theano.config.floatX),
name='w', borrow=True)
Just to mention the size n_in=64*56*56 and n_out=4096. The snippet is taken from the init method of a Fully Connected Layer. See the traceback:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "final.py", line 510, in __init__
loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
File "mtrand.pyx", line 1636, in mtrand.RandomState.normal (numpy/random/mtrand/mtrand.c:20676)
File "mtrand.pyx", line 242, in mtrand.cont2_array_sc (numpy/random/mtrand/mtrand.c:7401)
MemoryError
Is there any way we can get around the problem?

A MemoryError is Pythons way of saying: "I tried getting enough memory for that operation but your OS says it doesn't have enough".
So there's no workaround. You have to do it another way (or buy more RAM!). I don't know what your floatX is, but your array contains 64*56*56*4096 elements that translates to:
6.125 GB if you use float64
3.063 GB if you use float32
1.531 GB if you use float16 (not sure if float16 is supported for your operations though)
But the problem with MemoryErrors is that just avoiding them once generally isn't enough. If you don't change your approach you'll get problems again as soon as you do an operation that requires an intermediate or new array (then you have two huge arrays) or that coerces to a higher dtype (then you have two huge arrays and the new one is of higher dtype so requires more space).
So the only viable workaround is to change the approach, maybe you can start by calculating subsets (map-reduce approach)?

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

I am Working on Two Class Machine Learning Problem. Training Set contains 2-Millions Rows of URL(Strings) and Label 0 and 1. Classifier LogisticRegression() should predict any of two labels when testing datasets are passed. I am getting 95% accuracy results when i use smaller dataset i.e 78,000 URL and 0 and 1 as labels.
The Problem I am having is When I feed in big dataset (2 million row of URL strings) I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "C:/Users/Slim/.xy/startups/start/chi2-94.85 - Copy.py", line 48, in <module>
bi_counts = bi.fit_transform(url_list)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 717, in _count_vocab
j_indices.append(vocabulary[feature])
MemoryError
My code which is working for small datasets with fair enough accuracy is
bi = CountVectorizer(ngram_range=(3, 3),binary = True, max_features=9000, analyzer='char_wb')
bi_counts = bi.fit_transform(url_list)
tf = TfidfTransformer(norm='l2')
X_train_tf =tf.fit_transform(use_idf=True, bi_counts)
clf = LogisticRegression(penalty='l1',intercept_scaling=0.5,random_state=True)
clf.fit(train_x2,y)
I tried to keep 'max_features' as minimum as possible say max_features=100, but still same result.
Please Note:
I am Using core i5 with 4GB ram
I tried the same code on 8GB ram but
no luck
I am using Pyhon 2.7.6 with sklearn, NumPy 1.8.1, SciPy 0.14.0, Matplotlib 1.3.1
UPDATE:
#Andreas Mueller suggested to used HashingVectorizer(), i used it with small and large datasets, 78,000 dataset compiled successfully but 2-million dataset gave me same memory error as shown above. I tried it on 8GB ram and in-use memory space = 30% when compiling big dataset.

IIRC the max_features is only applied after the whole dictionary is computed.
The easiest way out is to use the HashingVectorizer that does not compute a dictionary.
You will lose the ability to get the corresponding token for a feature, but you shouldn't run into memory issues any more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fit multiple sequences with GMMHMM? - python

The current master version contains a re-write of GMMHMM which did not support multiple sequences at some point. Now it does, so updating should help, as #ppasler suggested. The re-write is still a work-in-progress. Please report any issues you encounter on the hmmlearn issue tracker.

Related

Applying K nearest neighbors algorithm causes issue with method train

Cannot load Doc2vec object using gensim

Dilation image-recognition algorithms cityScapes model

Memory Error: Numpy.random.normal

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

Categories

Resources