I have around 3000 datapoints in 100D that I project to 2D with t-SNE. Each datapoint belongs to one of three classes. However, when I run the script on two separate computers I keep getting inconsistent results. Some inconsistency is expected as I use a random seed, however one of the computers keeps getting better results (I use a macbook pro and a stationary machine on Ubuntu).
I use the t-SNE implementation from Scikit-learn. The script and data is identical, I've manually copied the folder to make sure. The relevant code snippet looks like this:
X_vectors, X_labels = self.load_data(spec_path, sound_path, subset)
tsne = TSNE(n_components=2, perplexity=25, random_state=None)
Y = tsne.fit_transform(X_vectors)
self.plot(X_labels, Y[:, 0], Y[:, 1], Y)
The first image is one sample generated from the macbook, I've ran it several times and it always generates a similar shape within the same x/y-range. The second is from Ubuntu and is clearly better, again I've ran it several times to make sure and it continues to generate better results, always in a higher x/y-range compared to the mac. Not sure what I'm not seeing here, it may very well be something obvious that I missed.
TSNE is a heuristic. Like most heuristics, it might behave quite differently according to small changes. The core characteristic here is: only local-convergence is guaranteed! (not very robust). The latter is indicated (follows from basic optimization-theory) in the docs:
t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
While you explained, that the non-seeding approach is not the culprit in your opinion (hard to measure! benchmarking is hard), you should check out your versions of sklearn, as the t-sne code is one of the more active parts of sklearn with many changes over time.
Each of these changes is likely to introduce observations like yours (when only trying one example; a bigger benchmark/testset should be a better approach when comparing t-sne implementations of course)
Remark: however one of the computers keeps getting better results: this is broad as there are at least two different interpretations:
rate the result visually / perceptually
look at kl_divergence_ achieved after optimization
Related
I'm working on the gensim’s word2vec model, but different runs on the same dataset produce the different model. I tried to set seed to a fixed number, including PYTHONHASHSEED and set the number of workers being one. But all the above methods are not working.
I included my code here:
def word2vec_model(data):
model = gensim.models.Word2Vec(data, size=300, window=20, workers=4, min_count=1)
model.wv.save("word2vec.wordvectors")
embed = gensim.models.KeyedVectors.load("word2vec.wordvectors", mmap='r')
return embed
I checked the following output:
Cooking.similar_by_vector(Cooking['apple'], topn=10, restrict_vocab=None)
example output:
[('apple', 0.9999999403953552),
('charcoal', 0.2554503381252289),
('response', 0.25395694375038147),
('boring', 0.2537640631198883),
('healthy', 0.24807702004909515),
('wrong', 0.24783077836036682),
('juice', 0.24270494282245636),
('lacta', 0.2373320758342743),
('saw', 0.2359238862991333),
('insufferable', 0.23015251755714417)]
Each run, I got different similar words.
Does anyone know how to solve it?I appreciate any direct codes or documentation. Thank you in advance!
You don't show how you're setting PYTHONHASHSEED. It must be set before the Python-interpreter is run to prevent Python's built-in string-hash randomization - so you can't set it from within Python.
But also: in general, you shouldn't be trying to eliminate the small jitter in results from run-to-run. It's an inherent part of these algorithms - especially when run in realistic, efficient modes (like with many training threads). Even if you succeed, you'll wind up in a very-slow (single-threaded-training) mode, and the stability of the results can mislead you about what can be expected in realistic deployments.
There's more discussion of this same point in the Gensim FAQ:
https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vec--doc2vec--etc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism
Separately, using min_count=1 is almost always a bad idea. These algorithms can't make good vectors for words that only have a single usage example, and generally give better results when rare words are discarded, as happens with the default min_count=5 value. (It is especially the case that if you have enough training data to justify big 300-dimensional word-vectors, you should likely be increasing, rather than decreasing, the default min_count=5 cutoff.)
We are using sklearn in python and trying to run agglomerative clustering (Wards) on a range of cluster numbers (i.e. N=2-9) using full_tree without having to re-compute the tree for each individual value of N, by using the cache. This was answered in an old post from 2016 but that answer doesn't seem to work anymore. See (sklearn agglomerative clustering: dynamically updating the number of clusters).
In other words, run fit over different values of N, without re-clustering every time. However, we are getting syntax errors and not able to call up the labels for any of the clusters stored in cache afterwards. Code looks something like:
x = AgglomerativeClustering(memory="mycachedir", compute_full_tree=True
but x.fit_predict(inputDF{2}) does not fit the syntax of the memory access command
Anybody know the syntax for calling up labels from the cache in this scenario? Thanks
P.S. I'm a newbie so apologies in advance if I am not being clear.
Tried solution posted in 2016 (sklearn agglomerative clustering: dynamically updating the number of clusters).
Code looks something like:
x = AgglomerativeClustering(memory="mycachedir", compute_full_tree=True
but x.fit_predict(inputDF{2}) does not fit the syntax of the memory access command
We expect to run clustering on a given array input and retrieve labels of each cluster when we vary the number of clusters "N" over a range, using the cache memory rather than re-computing the tree every time
The sklearn API is badly suited for this.
It's much better to use agglomerative clustering from scipy. Because it consists of two steps: Building the linkage / dendrogram and then extracting a flat clustering from this. The first step is O(n³) with Ward, but the second step is only O(n) I think. A similar approach can be found in ELKI, too. But unfortunately, sklearn follows this narrow "fit-predict" view originating from classification, and that does not support such a two-step approach.
There is also other functionality available in scipy, but not in sklearn, if I am not mistaken. Just have a look.
I'd be thankful for all thoughts, tips or links on this:
Using TF 1.10 and the recent object detection-API (github, 2018-08-18) I can do box- and mask prediction using the PETS dataset as well as using my own proof of concept data-set:
But when training on the cityscapes traffic signs (single class) I am having troubles to achieve any results. I have adapted the anchors to respect the much smaller objects and it seems the RPN is doing something useful at least:
Anyway, the box predictor is not going into action at all. That means I am not getting any boxes at all - not to ask for masks.
My pipelines are mostly or even exactly like the sample configs.
So I'd expect either problems with the specific type of data or a bug .
Would you have any tips/links how to (either)
visualize the RPN results when using 2 or 3 stages? (Using only one stage does that, but how would one force that?)
train the RPN first and continue with boxes later?
investigate where/why the boxes get lost? (having predictions with zero scores while evaluation yields zero classification error)
The Solution finally was a combination of multiple issues:
The parameter from_detection_checkpoint: true is depreciated and to be replaced by fine_tune_checkpoint_type: 'detection'. However, without any of those the framework seems to default to 'classification', what seems to break the whole idea of the object detection framework. No good idea to rely on the defaults this time.
My data wasn't prepared good enough. I had boxes with zero width+/height (for whatever reason). I also removed masks for instances that were disconnected.
Using the keep_aspect_ratio_resizer together with random_crop_image and random_coef: 0.0 does not seem to allow for the full resolution as the resizer seems to be applied before the random cropping. I do now split my input images into (vertical) stripes [for memory saving] and apply the random_crop with a small min_area so it does not skip the small features at all. Also I can now allow for a max_area: 1 and a random coefficient > 0, as the memory usage is dealt with.
One potential problem also arose from the fact that I only considered a single class (so far). This might be a problem either for the framework, or for the activation function in the network. However, in combination with the other issues this change seemed to cause no additional problems - at minimum.
Last but not least I updated the sources to 2018-10-02 but didn't walk through all modifications in detail.
I hope others can save time and troubles from my findings.
I'd like to run a chi-squared test in Python. I've created code to do this, but I don't know if what I'm doing is right, because the scipy docs are quite sparse.
Background first: I have two groups of users. My null hypothesis is that there is no significant difference in whether people in either group are more likely to use desktop, mobile, or tablet.
These are the observed frequencies in the two groups:
[[u'desktop', 14452], [u'mobile', 4073], [u'tablet', 4287]]
[[u'desktop', 30864], [u'mobile', 11439], [u'tablet', 9887]]
Here is my code using scipy.stats.chi2_contingency:
obs = np.array([[14452, 4073, 4287], [30864, 11439, 9887]])
chi2, p, dof, expected = stats.chi2_contingency(obs)
print p
This gives me a p-value of 2.02258737401e-38, which clearly is significant.
My question is: does this code look valid? In particular, I'm not sure whether I should be using scipy.stats.chi2_contingency or scipy.stats.chisquare, given the data I have.
I can't comment too much on the use of the function. However, the issue at hand may be statistical in nature. The very small p-value you are seeing is most likely a result of your data containing large frequencies ( in the order of ten thousand). When sample sizes are too large, any differences will become significant - hence the small p-value. The tests you are using are very sensitive to sample size. See here for more details.
You are using chi2_contingency correctly. If you feel uncertain about the appropriate use of a chi-squared test or how to interpret its result (i.e. your question is about statistical testing rather than coding), consider asking it over at the "CrossValidated" site: https://stats.stackexchange.com/
(nb. just posted this on the google group, but it says it is now deprecated)
I have some code which fits about 12 model parameters to a series of datasets. The results from the pymc code appear fine and consistent with an identical version of the code I have which uses the lmfit package, i.e. non-linear least squares. One concern I do have is that the 95% credible intervals are to my mind tiny and this suggests to me there is an error somewhere. The standard errors from the other fitting script are reasonable in size and the function is complex enough to suggest such unique minima are unlikely. Could this be a consequence of how I am sampling the data? I am carrying out 100,000 iterations, burning 50,000 and thinning by a factor of 10.
My code is:
https://github.com/mdekauwe/FitFarquharModel/blob/master/fit_farquhar_model/fit_dummy_pymc.py
I can try and upload a sample driving file if that helps, but perhaps I have done something obviously stupid?
When I say tiny here is an example:
[lmfit] Vcmax25_1 = 16.55232485 +/- 1.22831709 (Std.err)
[pymc] Vcmax25_1 = 19.5718912 [19.57150052, 19.57232205] (95% HPD)
Many thanks,
Martin
ps. I have added an example file should anyone want to test it. The bottom of that script has the necessary links...(of course one would need to download the files from the examples directory)
my guess is the sampler must get stuck so I will try and look in more detail at the traces.