I can't seem to find it or probably my knowledge on statistics and its terms are the problem here but I want to achieve something similar to the graph found on the bottom page of the LDA lib from PyPI and observe the uniformity/convergence of the lines. How can I achieve this with Gensim LDA?
You are right to wish to plot the convergence of your model fitting.
Gensim unfortunately does not seem to make this very straight forward.
Run the model in such a way that you will be able to analyze the output of the model fitting function. I like to setup a log file.
import logging
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
Set the eval_every parameter in LdaModel. The lower this value is the better resolution your plot will have. However, computing the perplexity can slow down your fit a lot!
lda_model =
LdaModel(corpus=corpus,
id2word=id2word,
num_topics=30,
eval_every=10,
pass=40,
iterations=5000)
Parse the log file and make your plot.
import re
import matplotlib.pyplot as plt
p = re.compile("(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity")
matches = [p.findall(l) for l in open('gensim.log')]
matches = [m for m in matches if len(m) > 0]
tuples = [t[0] for t in matches]
perplexity = [float(t[1]) for t in tuples]
liklihood = [float(t[0]) for t in tuples]
iter = list(range(0,len(tuples)*10,10))
plt.plot(iter,liklihood,c="black")
plt.ylabel("log liklihood")
plt.xlabel("iteration")
plt.title("Topic Model Convergence")
plt.grid()
plt.savefig("convergence_liklihood.pdf")
plt.close()
Related
I've patched the following code from examples I've found over the web:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
# random
from random import shuffle
# classifier
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
return self.sentences
def sentences_perm(self):
shuffle(self.sentences)
return self.sentences
sources = {'test.txt' : 'DOCS'}
sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
model.train(sentences.sentences_perm())
print(model.docvecs)
my test.txt file contains a paragraph per line.
The code runs fine and generates DocvecsArray for each line of text
my goal is to have an output like so:
cluster 1: [DOC_5,DOC_100,...DOC_N]
cluster 2: [DOC_0,DOC_1,...DOC_N]
I have found the following Answer, but the output is:
cluster 1: [word,word...word]
cluster 2: [word,word...word]
How can I alter the code and get document clusters?
So it looks like you're almost there.
You are outputting a set of vectors. For the sklearn package, you have to put those into a numpy array - using the numpy.toarray() function would probably be best. The documentation for KMeans is really stellar and even across the whole library it's good.
A note for you is that I have had much better luck with DBSCAN than KMeans, which are both contained in the same sklearn library. DBSCAN doesn't require you to specify how many clusters you want to have on the output.
There are well-commented code examples in both links.
In my case I used:
for doc in docs:
doc_vecs = model.infer_vector(doc.split())
# creating a matrix from list of vectors
mat = np.stack(doc_vecs)
# Clustering Kmeans
km_model = KMeans(n_clusters=5)
km_model.fit(mat)
# Get cluster assignment labels
labels = km_model.labels_
# Clustering DBScan
dbscan_model = DBSCAN()
labels = dbscan_model.fit_predict(mat)
Where model is the pre-trained Doc2Vec model. In my case I didn't need to cluster the same documents of the training but new documents saved in the docs list
I'm trying to model some simulated data using the DPGMM classifier from scikitlearn, but I'm getting poor performance. Here is the example I'm using:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
clf = mixture.DPGMM(n_components=5, init_params='wc')
s = 0.1
a = np.random.normal(loc=1, scale=s, size=(1000,))
b = np.random.normal(loc=2, scale=s, size=(1000,))
c = np.random.normal(loc=3, scale=s, size=(1000,))
d = np.random.normal(loc=4, scale=s, size=(1000,))
e = np.random.normal(loc=7, scale=s*2, size=(5000,))
noise = np.random.random(500)*8
data = np.hstack([a,b,c,d,e,noise]).reshape((-1,1))
clf.means_ = np.array([1,2,3,4,7]).reshape((-1,1))
clf.fit(data)
labels = clf.predict(data)
plt.scatter(data.T, np.random.random(len(data)), c=labels, lw=0, alpha=0.2)
plt.show()
I would think that this would be exactly the kind of problem that gaussian mixture models would work for. I've tried playing around with alpha, using gmm instead of dpgmm, changing the number of starting components, etc. I can't seem to get a reliable and accurate classification. Is there something I'm just missing? Is there another model that would be more appropriate?
Because you didn't iterate long enough for it to converge.
Check the value of
clf.converged_
and try increasing n_iter to 1000.
Note that, however, the DPGMM still fails miserably IMHO on this data set, decreasing the number of clusters to just 2 eventually.
I am trying to build a simple neural network using Python and Pybrain package.
As I am starting to learn both the method and Pybrain package. I tried to make a very simple neuralnet with some real data that I have available!
I know there is an underlying connection to my data, however the code does not converge at all, and the results after the training are basically the same for any set of real validation data that I put there. Below is my code and a small part of the data. I have over 5000 lines of data available with known g to train my network, but it does not matter the number of points added to the training.
from pybrain.tools.shortcuts import buildNetwork as bld
from pybrain.datasets import SupervisedDataSet as spds
from pybrain.supervised.trainers import BackpropTrainer as bpt
import numpy as np
u,g,r,i,z = np.loadtxt("dataset.dat",unpack=True)
data = spds(4,1)
net = bld(4,1000,1)
for i in range(0,len(umag)):
data.addSample((u[i],r[i],i[i],z[i]),(g[i]))
trainer = bpt(net,data)
trainer.trainUntilConvergence(dataset=data,maxEpochs=300,validationProportion=0.5)
p = net.activate([17.136,15.812,15.693,15.675])
print p
#expected result 16.225
p = net.activate([19.382,17.684,17.511,17.435])
# 18.195 - expected result
print p
18.14981 15.10829 13.96468 -10.8685 13.20411
16.84580 15.17839 14.61974 14.44930 14.44493
16.70895 15.57959 15.28097 15.16538 15.19260
18.44166 16.32709 15.45345 15.14938 15.04544
18.03881 16.49129 15.96768 15.78446 15.77211
21.15679 18.66248 17.46381 16.97513 16.75475
19.25665 17.80023 17.18956 16.97563 16.94967
17.01522 16.08040 15.85172 15.81930 15.92262
19.21695 17.72263 17.17900 16.98280 16.97201
19.98507 18.56911 17.98143 17.80738 17.81714
16.94824 15.97417 15.70555 15.59221 15.64357
21.20893 19.40982 18.68114 18.46647 18.43065
18.72652 17.38880 16.93716 16.73246 16.75096
20.57421 19.55045 19.15475 18.99772 19.02503
22.48833 20.07709 18.68276 17.60561 17.09613
22.27604 20.34056 19.66521 19.37319 19.30457
20.58372 19.18035 18.64691 18.43370 18.39288
22.25103 20.74570 20.16532 19.94144 19.78580
22.49646 19.63043 18.39409 17.97594 17.77803
19.22686 17.55373 16.97127 16.76445 16.70418
20.44500 19.34502 18.96556 18.80437 18.78767
22.69331 21.19628 19.89190 19.39628 19.11377
19.51075 18.02397 17.46963 17.31436 17.27759
19.92604 18.49456 17.97421 17.83519 17.80557
19.18904 18.22256 17.84221 17.70319 17.64457
20.23186 18.43468 17.81423 17.60103 17.54677
19.86590 18.32822 17.75089 17.57386 17.53067
20.84188 19.78345 19.42506 19.27895 19.34572
22.14103 21.86670 21.74832 21.61244 21.99680
18.02018 16.69380 16.23947 16.12869 16.09864
19.92574 18.63316 18.15877 17.95703 17.90224
Generally speaking, I get better results if I have scaled my data to be between 0 and 1, or better yet between 0.1 and 0.9. The neuron output is usually going to be between 0 and 1. You might try scaling your inputs and outputs to be within this range, and see if you get better results.
This is perhaps a silly question.
I'm trying to fit data to a very strange PDF using MCMC evaluation in PyMC. For this example I just want to figure out how to fit to a normal distribution where I manually input the normal PDF. My code is:
data = [];
for count in range(1000): data.append(random.gauss(-200,15));
mean = mc.Uniform('mean', lower=min(data), upper=max(data))
std_dev = mc.Uniform('std_dev', lower=0, upper=50)
# #mc.potential
# def density(x = data, mu = mean, sigma = std_dev):
# return (1./(sigma*np.sqrt(2*np.pi))*np.exp(-((x-mu)**2/(2*sigma**2))))
mc.Normal('process', mu=mean, tau=1./std_dev**2, value=data, observed=True)
model = mc.MCMC([mean,std_dev])
model.sample(iter=5000)
print "!"
print(model.stats()['mean']['mean'])
print(model.stats()['std_dev']['mean'])
The examples I've found all use something like mc.Normal, or mc.Poisson or whatnot, but I want to fit to the commented out density function.
Any help would be appreciated.
An easy way is to use the stochastic decorator:
import pymc as mc
import numpy as np
data = np.random.normal(-200,15,size=1000)
mean = mc.Uniform('mean', lower=min(data), upper=max(data))
std_dev = mc.Uniform('std_dev', lower=0, upper=50)
#mc.stochastic(observed=True)
def custom_stochastic(value=data, mean=mean, std_dev=std_dev):
return np.sum(-np.log(std_dev) - 0.5*np.log(2) -
0.5*np.log(np.pi) -
(value-mean)**2 / (2*(std_dev**2)))
model = mc.MCMC([mean,std_dev,custom_stochastic])
model.sample(iter=5000)
print "!"
print(model.stats()['mean']['mean'])
print(model.stats()['std_dev']['mean'])
Note that my custom_stochastic function returns the log likelihood, not the likelihood, and that it is the log likelihood for the entire sample.
There are a few other ways to create custom stochastic nodes. This doc gives more details, and this gist contains an example using pymc.Stochastic to create a node with a kernel density estimator.
I have a data set that I know has a Pareto distribution. Can someone point me to how to fit this data set in Scipy? I got the below code to run but I have no idea what is being returned to me (a,b,c). Also, after obtaining a,b,c, how do I calculate the variance using them?
import scipy.stats as ss
import scipy as sp
a,b,c=ss.pareto.fit(data)
Be very careful fitting power laws!! Many reported power laws are actually badly fitted by a power law. See Clauset et al. for all the details (also on arxiv if you don't have access to the journal). They have a companion website to the article which now links to a Python implementation. Don't know if it uses Scipy because I used their R implementation when I last used it.
Here's a quickly written version, taking some hints from the Reference page that Rupert gave.
This is currently work in progress in scipy and statsmodels and requires MLE with some fixed or frozen parameters, which is only available in the trunk versions.
No standard errors on the parameter estimates or other result statistics are available yet.
'''estimating pareto with 3 parameters (shape, loc, scale) with nested
minimization, MLE inside minimizing Kolmogorov-Smirnov statistic
running some examples looks good
Author: josef-pktd
'''
import numpy as np
from scipy import stats, optimize
#the following adds my frozen fit method to the distributions
#scipy trunk also has a fit method with some parameters fixed.
import scikits.statsmodels.sandbox.stats.distributions_patch
true = (0.5, 10, 1.) # try different values
shape, loc, scale = true
rvs = stats.pareto.rvs(shape, loc=loc, scale=scale, size=1000)
rvsmin = rvs.min() #for starting value to fmin
def pareto_ks(loc, rvs):
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, loc, np.nan])
args = (est[0], loc, est[1])
return stats.kstest(rvs,'pareto',args)[0]
locest = optimize.fmin(pareto_ks, rvsmin*0.7, (rvs,))
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, locest, np.nan])
args = (est[0], locest[0], est[1])
print 'estimate'
print args
print 'kstest'
print stats.kstest(rvs,'pareto',args)
print 'estimation error', args - np.array(true)
Let's say you data is formated like this
import openturns as ot
data = [
[2.7018013],
[8.53280352],
[1.15643882],
[1.03359467],
[1.53152735],
[32.70434285],
[12.60709624],
[2.012235],
[1.06747063],
[1.41394096],
]
sample = ot.Sample([[v] for v in data])
You can easily fit a Pareto distribution using ParetoFactory of OpenTURNS library:
distribution = ot.ParetoFactory().build(sample)
You can of course print it:
print(distribution)
>>> Pareto(beta = 0.00317985, alpha=0.147365, gamma=1.0283)
or plot its PDF:
from openturns.viewer import View
pdf_graph = distribution.drawPDF()
pdf_graph.setTitle(str(distribution))
View(pdf_graph, add_legend=False)
More details on the ParetoFactory are provided in the documentation.
Before passing the data to build() function in OPENTURNS, make sure to convert it this way:
data = [[i] for i in data]
Because Sample() function may return an error.
FYI #Tropilio