I have a dataset composed of millions of examples, where each example contains 128 continuous-value features classified with a name. I'm trying to find a large robust database/index to use to use as a KNN classifier for high-dimensional data. I tried Weka's IBk classifier, but it chokes on this much data, and even then it has to be loaded into memory. Would Lucene, specifically through the PyLucene interface, be a possible alternative?
I've found Lire, which seems to use Lucene in a similar way, but after reviewing the code, I'm not sure how they're pulling it off, or if it's the same thing I'm trying to do.
I realize Lucene is designed as a text indexing tool, and not as a general purpose classifier, but is it possible to use it in this way?
Lucene doesn't seem like the right choice given what you've told us. Lucene would give you a way to store the data, but in terms of retrieval, it's not designed to do anything but search over textual strings.
Since K-NN is so simple, you might be better off creating your own data store in a typical RDBMS or something like Berkeley DB. You could create keys/indicies based on sub-hypercubes of the various dimensions to speed things up - start at the bucket of the item to be classified and move outward...
This is done in Lucene already with geospatial searches. Of course, the built-in geospatial searches only use two dimensions, so you'll have to modify it a bit. But the basic idea of using numeric range queries will work.
(Note: I'm not aware of anyone doing high-dimensional kNN with Lucene. So I can't comment on how fast it will be.)
Related
I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.
Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?
You will need to do it manually.
There are probably two different core-problems here:
A: holding your training-data
B: training the regressor
For A, you can try numpy's memmap which abstracts swapping away.
As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.
For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.
Keep in mind, that this training-process decomposes into at least two new elements:
Efficient being in regards to memory
Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
Efficient convergence
Those algorithms in the link above should be okay for both.
SGDRegressor can be parameterized to resemble RidgeRegression.
Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!
Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).
Introduction
I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process.
Proposed Solution Specifications
I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). I tend to use the topicmodels package in R, however I would like to switch to MALLET as it offers many benefits over topicmodels. It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. However some of the issues outline below are also relevant for topicmodels too. I'd like to know how others approach topic modelling and which of the below steps could be improved. Any useful piece of advice is welcome.
Outline
Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step.
Proposed Workflow
1. Clean text
This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix from R's package tm.
Problem: This however may need to be performed on the text strings directly, using functions such as gsub in order for MALLET to consume these strings. Performing in on the strings directly is not as efficient as it involves repetition (e.g. the same word would have to be stemmed several times)
2. Construct features
In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. It is preferable to limit your bag of features to about 1000 or so. Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. While I'm familiar with the concept of stem-completion, I find that it rarely works well.
Problem: (1) Unfortunately MALLET does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. This difference would become stop words for MALLET. (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts.
Side note: If you decide to stick with R alone, then I can recommend the quanteda package which has a function dfm that accepts a thesaurus as one of the parameters. This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\bsign\\w*.?ups? that would match sign-up, signed up and so on.
3. Find optimal parameters
This is a hard one. I tend to break data into test-train sets and run cross-validation fitting a model of k topics and testing the fit using held-out data. Log likelihood is recorded and compared for different resolutions of topics.
Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on.
Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute.
4. Maintenance
It is easy to classify new data into a set existing topics. However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. Furthermore, it might be of interest to study the lifecycle of topics. This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way.
Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer.
Recap of Problems
String cleaning for MALLET to consume the data is inefficient.
Feature selection requires manual work.
Optimal number of topics selection based on LL does not account for what is practically sensible
Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics)
Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant.
If you've read that far, I'd like to thank you, this is a rather long post. If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems.
Cheers
Thank you for this thorough summary!
As an alternative to topicmodels try the package mallet in R. It runs Mallet in a JVM directly from R and allows you to pull out results as R tables. I expect to release a new version soon, and compatibility with tm constructs is something others have requested.
To clarify, it's a good idea for documents to be at most around 1000 tokens long (not vocabulary). Any more and you start to lose useful information. The assumption of the model is that the position of a token within a given document doesn't tell you anything about that token's topic. That's rarely true for longer documents, so it helps to break them up.
Another point I would add is that documents that are too short can also be a problem. Tweets, for example, don't seem to provide enough contextual information about word co-occurrence, so the model often devolves into a one-topic-per-doc clustering algorithm. Combining multiple related short documents can make a big difference.
Vocabulary curation is in practice the most challenging part of a topic modeling workflow. Replacing selected multi-word terms with single tokens (for example by swapping spaces for underscores) before tokenizing is a very good idea. Stemming is almost never useful, at least for English. Automated methods can help vocabulary curation, but this step has a profound impact on results (much more than the number of topics) and I am reluctant to encourage people to fully trust any system.
Parameters: I do not believe that there is a right number of topics. I recommend using a number of topics that provides the granularity that suits your application. Likelihood can often detect when you have too few topics, but after a threshold it doesn't provide much useful information. Using hyperparameter optimization makes models much less sensitive to this setting as well, which might reduce the number of parameters that you need to search over.
Topic drift: This is not a well understood problem. More examples of real-world corpus change would be useful. Looking for changes in vocabulary (e.g. proportion of out-of-vocabulary words) is a quick proxy for how well a model will fit.
I am creating an appEngine application in python that will need to perform efficient geospatial queries on datastore data. An example use case would be, I need to find the first 20 posts within a 10 mile radius of the current user. Having done some research into my options, I have found that currently what seems like the 2 best approaches for achieving this type of functionality would be:
Indexing geoHashed geopoint data using Python's GeoModel library
Creating/deleting documents of structured data using Google's newer SearchAPI
It seems from a high level perspective that indexing geohashes and performing queries on them directly would be less costly and much faster than having to create and delete a document for every geospatial query, however i've also read that geohashing can be very inaccurate along the equator or along 'faultlines' created by the hashing algorithm. I've seen very few posts contrasting the best methods in detail, and I think stack is a good place to have this conversation, so my questions are as follows:
Has anyone implemented similar features and had positive experiences with either methods?
Which method would be the cheaper alternative?
Which would be the faster alternative?
Is there another important method I'm leaving out?
Thanks in advance.
Geohashing does not have to be inaccurate at all. It's all in the implementation details. What I mean is you can check the neighbouring geocells as well to handle border-cases, and make sure that includes neighbours on the other side of the equator.
If your use case is finding other entities within a radius as you suggest, I would definitely recommend using the Search API. They have a distance function tailored for that use.
Search API queries are more expensive than Datastore queries yes, but if you weigh in the computation time to do these calculations in your instance and probably iterating through all entities for each geohash to make sure the distance is actually less than the desired radius, then I would say Search API is the winner. And don't forget about the implementation time.
You can have a look at this post, it can be another great alternative.
I have used this within my app and it works great for my requirement to find my app users with-in provided radius .
Is anyone aware of a KD-Tree, or similar spatial index, implemented in SQL? I was considering writing my own using Python and Django's ORM, but I'd like to avoid reinventing the wheel.
I have a table containing millions of rows, with each row containing 128 columns representing image feature data. Given an arbitrary 128-element long list of image features, I want to use a KD-Tree to find the N most similar images in the database. I've found a lot of KD-Tree implementations, but they all appear to only load in local memory and don't scale or talk to databases.
KD-tree does not work well for high-dimensional data, and 128 dimensions would be quite high. The KD-tree indexes each dimension at a different level of the tree, and when performing a query the algorithm will do a lot of back-tracking (searching both sides of a branch) and ends up searching most of the points in the tree. When this happens the advantages of using a tree structure disappear and an exhaustive comparison ends up running faster.
You may want to find an existing image similarity search system that you can map your data into. Here is one called Lire which extracts features from images and indexes them using Lucene.
If your work is more research-oriented you may want to read up on metric space indexes and approximate k-nearest neighbor search.
I might be a little out here, but your best bet may be using the Gist / Gin indexes inside of Postgresql
I need a data structure for doing 2d range counting queries (i.e. how many points are in a given rectangle).
I think my best bet is range tree (it can count in log^2, or even log after some optimizations). Does it sound like a good choice? Does anybody know about a python implementation or will I have to write one myself?
See scipy.spatial.KDTree for one implementation.
There's also a less generic (but occasionally more useful, particularly with regards to what you have in mind) implementation using shapelib's quadtree. See this blog and the corresponding package in PyPi.
There are probably other implementations, too, but those are the two that I've used...