Text classification using a sieve

Text classification using a sieve - python

I'm currently kinda lost with my Problem, hopefully I can describe it acurate enough.
What I'm trying to do:
I have a list of tiny text fragments (think twitter) and want to categorize them into different topics.
To do so, I check for the occurence of some keywords in the text - keywords that might contain wildcards. The current approach to use googles regular expressions 2 engine is real slow, because we have to check every keyword until one hits. -> linear search
What I'm looking for? Something like a postgresql trgm gin index where I can execute reverse searches - so instead of going through all the Keywords, I'd like some faster way of searching.
Actually using postgresql in this case does not work, because it shall be used as some kind of SAS.

Related

How Lucene positional index works so efficiently?

Usually any search engine software creates inverted indexes to make searches faster. The basic format is:-
word: <docnum ,positions>, <docnum ,positions>, <docnum ,positions> .....
Whenever there is a search query inside quote like "Harry Potter Movies" it means there should be exact match of positions of word and in searches like within k word queries like hello /4 world it generally means that find the word world in the range of 4 word distance either in left or right from the word hello. My question is that we can employ solution like linearly checking the postings and calculating distances of words like in query, but if collection is really large we can't really search all the postings. So is there any other data structure or kind of optimisation lucene or solr uses?
One first solution can be only searching some k postings for each word. Other solution can be only searching top docs(usually called champion list sorted by tf-idf or similar during indexing), but more better docs can be ignored. Both solutions have some disadvantage, they both don't ensure quality. But in Solr server we get assured quality of results even in large collections. How?

The phrase query you are asking about here is actually really efficient to compute the positions of, because you're asking for the documents where 'Harry' AND 'Potter' AND 'Movies' occur.
Lucene is pretty smart, but the core of its algorithm for this is that it only needs to visit the positions lists of documents where all three of these terms even occur.
Lucene's postings are also sharded into multiple files:
Inside the counts-files are: (Document, TF, PositionsAddr)+
Inside the positions-files are: (PositionsArray)
So it can sweep across the (doc, tf, pos_addr) for each of these three terms, and only consult the PositionsArray when all three words occur in the specific document. Phrase queries have the opportunity to be really quick, because you only visit at most all the documents from the least-frequent term.
If you want to see a phrase query run slowly (and do lots of disk seeks!), try: "to be or not to be" ... here the AND part doesn't help much because all the terms are very common.

Come up with a search algorithm to search within documents

I was working on a project where I need to search within all the documents(pdf/docs etc) present in the database relevant to any query.
I earlier used a simple relation where I stored relevant keywords associated with a document and if the query contains those keywords then I fetch those documents. But this method is not so reliable as those keywords might be misleading. I need to search within the documents and I am looking for a practical search algorithm that can scale well and has less time complexity.
Any suggestions and resources are most welcome.
Thank you.

Try Rabin-Karp (based on hashcode) search algorithm. As you have to search more than one pattern in many documents, it will get the hashcode of all patterns and will look for all patterns in one go.

Ways to search and tag a MongoDB database of academic papers

Apologies for the vague nature of this question but I'm honestly not quite sure where to start and thought I'd ask here for guidance.
As an exercise, I've downloaded several academic papers and stored them as plain text in a mongoDB database.
I'd like to do write a search feature (using Python, R, whatever) that when you enter text and returns the most relevant articles. Clearly, relevant is really hard -- that's what google got so right.
However, I'm not looking for it to be perfect. Just to get something. A few thoughts I had were:
1) Simple MongoDB full text search
2) Implement Lucene Search
3) Tag them (unsure how though) and then return them sorted by the most number of tags?
Is there a solution someone has used that's out of the box and works fairly well? I can always optimize the search feature later -- for now I just want all the pieces to move together...
Thanks!

Is there a solution someone has used that's out of the box and works fairly well?
It depends on how you define well, but in simple terms, I'd say no. There is just no single and accurate definition of fairly well. A lot of challenges intrinsic to a particular problem arise when one trying to implement a good search algorithm. Those challenges lies in:
users needs diversity. Users in different fields have different intentions and as a result different expectation from a search result page;
natural languages diversity, if you are trying to implement multi-language search (German has a lot of noun compounds, Russian has enormous flexion variability etc.);
There are some algorithms that are proven to work better than others though, thus are good to start from. TF*IDF and BM25 two most popular.
I can always optimize the search feature later -- for now I just want all the pieces to move together...
MongoDB or any RDBMS with fulltext indexing support is good enough for a proof-of-concept, but if you need to optimize for search performance, you will need an inverted index (Solr/Lucene). From Solr/Lucene you will get ability to manage:
how exactly words are stemmed (this is important to solve undersemming/overstemming problems);
what the word is. Is "supercomputer" one word? What about "stackoverflow" or "OutOfBoundsException"?
synonyms and word expansion (should "O2" be found for a "oxygen" query?)
how exactly search is performed. Which words could be ignored during search. Which ones are required to be found. Which one are required to be found near each other (think of search phrase: "not annealed" or "without expansion").
This is just what comes to mind first.
So if you are planning to work these things out I definitely recommend Lucene as a framework or Solr/ElasticSearch as a search system if you need to build proof-of-concept fast. If not, MongoDB/RDMS will work well.

How do I build an efficient email filter against large ruleset (5000+ and growing)

I'm building an email filter and I need a way to efficiently match a single email to a large number of filters/rules. The email can be matched on any of the following fields:
From name
From address
Sender name
Sender address
Subject
Message body
Presently there are over 5000 filters (and growing) which are all defined in a single table in our PostgreSQL (9.1) database. Each filter may have 1 or more of the above fields populated with a Python regular expression.
The way filtering is currently being done is to select all filters and load them into memory. We then iterate over them for each email until a positive match is found on all non-blank fields. Unfortunately this means for any one email there can potentially be as many as 30,000 (5000 x 6) re.match operations. Clearly this won't scale as more filters get added (actually it already doesn't).
Is there a better way to do this?
Options I've considered so far:
Converting saved python regular expressions to POSIX style ones to make use of PostgreSQL's SIMILAR TO expression. Will this really be any quicker? Seems to me like it's simply shifting the load somewhere else.
Defining filters on a per user basis. Though this isn't really practical because with our system users actually benefit from a wealth of predefined filters.
Switching to a document-based search engine like elastic search where the first email to be filtered is saved as the canonical representation. By finding similar emails we can then narrow down to a specific feature set to test on and get a positive match.
Switching to a bayes filter which would also give us some machine learning capability to detect similar emails or changes to existing emails that would still match with a high enough probability to guess that they were the same thing. This sounds cool but I'm not sure it would scale particularly well either.
Are there other options or approaches to consider?

The trigram support in PostgreSQL version 9.1 might give you what you want.
http://www.postgresql.org/docs/9.1/interactive/pgtrgm.html
It almost certainly will be a viable solution in 9.2 (scheduled for release in summer of 2012), since the new version knows how to use a trigram index for fast matching against regular expressions. At our shop we have found the speed of trigram indexes to be very good.
Also, if you ever want to do a "nearest neighbor" search, where you find the K best matches based on similarity to a search argument, a trigram index is wonderful -- it actually returns rows from the index scan in order of "distance". Search for KNN-GiST for write-ups.

how complex are these regexps? if they really are regular expressions (without all the crazy python extensions) then you can combine them all into a single regexp (as alternatives) and then use a simple (ie in-memory) regexp matcher.
i am not sure this will work, but i suspect that you will be pleasantly surprised that the regexp compiles to a significantly smaller state machine, because it will merge common states.
also, for a fast regular expression engine, consider using nrgrep which will fast scan. this should give you a speedup when jumping from one header to the next (i haven't use it myself, but they're friends of friends and the paper looked pretty neat when i read it).

How to do related questions autopopulate

I want to get a related [things/questions] in my app, similar to what StackOverflow does, when you tab out of the Title field.
I can think of only one way to do it, which i think might be fast enough
Do a search for the title in corpus of titles of all [things], and return first x matches. We can use whatever search is being used for site search.
What are other ways o do this, which are fast enough, as this is going to be sent on tabout, so a large server side processing is not feasible for it.
I am just looking for the way to do this, but I am using mysql and DJango, so if your answer uses that, all the better.
[I cannot think of good tags for it, so please feel free to edit them]

You're looking at a content-based recommendation algorithm. AFAICT StackOverflow's looks at the tags and the words in the title, and finds questions that share some of these. It can be implemented as a nearest neighbour search in a space where documents are represented as TF-IDF vectors.
Implementation-wise, go with any Django search engine that supports stemming, stopwords, non-strict matches, and tf-idf weights. Algorithmic complexity isn't high (just a few index lookups), so it doesn't matter if it's written in Python.
If you don't find a search engine doing what you want, leave the stemming and stopwords to the search engine, call the search engine on individual words, and do your own tf-idf scoring with a score that favors similar tags.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.