Python interval based sparse container

Python interval based sparse container - python

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl

If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

Related

How to translate lists into dictionaries or numpy arrays preserving structural information and improving efficiency?

I am building a database of concepts and neighbour concepts from a certain set of text files (arbitral awards). So far, I use lists, as I believe (not an expert here) it is the computable most simple and efficient way to store and retreive the information I'm using. The structure of those lists is:
memory = [
[tag1,number1,
[
[tag11,number11],
[tag12,number12],
...
]
],
...
]
When feed with a string (usually by the lenght of a paragraph or a single page) my script will look for the tag and then for the sub tags of every tag. Basically:
for tag in memory:
if tag[0] in text:
tag[1]+=1
else:
for subtag in tag[2]:
if subtag[0] in text:
subtag[1]+=1
tag[1]+=1
There are some extra rules to break the search and to avoid repetitions, but you can imagine they don't solve the 'for in for in for' loop problem! (accounting for the 'in text' part)
My purpose is to build a semantic structure by enumeration, by relating tags with sub tags. For example: 'Greetings', with: 'Hi', 'Hello', 'Good Morning', and so on. The numbers say how often a tag is found in the text files I use and the sub-tags numbers say how often a certain sub tag is related with its parent tag.
My problem is that, after a few months using this structure, I have a fair amount of tags and sub tags as to make my script run slow every time it has to search every tag and every sub tag within a given string, and I'm looking for options to solve this problem.
One option I have is to migrate to numpy arrays, but I wouldn't know where to start to be sure I will gain on efficiency and not just translate my problem into a more fancy structure. I am familiar with matrix multiplication and tensor products with numpy, which seem to be applicable here with some sort of convolutional algorithm (just guessing over what I've done before), but I'm not sure if they work as well with strings as I've seen them work with numbers, because I would be multipliying small matrices (tag list) with large matrices (strings), and I have been told that numpy is more usefull with large to large multiplications.
The other option is to use dictionaries, at least as an intermediate step, which I sincerely don't get how whould make things go faster, but they were suggested by an engineer (I'm a lawyer so... no idea). The problem with the last is how to keep track of ocurrencies of keys. I can see how I can translate my list into dictionaries, but even tough I can think of ways to translate and update the ocurrency information, I just feel it won't be more efficient, as the same for loops would have to be executed.
One option would be to set the ocurrencies as the first value of every key... but again, not sure how it would make my script go faster.
I get that the increasing amount of computations won't just go away, but as an amateur-intermediate enthusiast of programming, I understand that my method can be sensibily improved by going further than simple for loops.
Therefore, as a non expert, I would appreciate some information, guidance or just references about how to translate my list into dictionaries or numpy arrays which at the same time help me to loop over tags and sub-tags faster than the basic for loop.

How to create a nested data structure in Python?

Since I recently started a new project, I'm stuck in the "think before you code" phase. I've always done basic coding, but I really think I now need to carefully plan how I should organize the results that are produced by my script.
It's essentially quite simple: I have a bunch of satellite data I'm extracting from Google Earth Engine, including different sensors, different acquisition modes, etc. What I would like to do is to loop through a list of "sensor-acquisition_mode" couples, request the data, do some more processing, and finally save it to a variable or file.
Suppose I have the following example:
sensors = ['landsat','sentinel1']
sentinel_modes = ['ASCENDING','DESCENDING']
sentinel_polarization = ['VV','VH']
In the end, I would like to have some sort of nested data structure that at the highest level has the elements 'landsat' and 'sentinel1'; under 'landsat' I would have a time and values matrix; under 'sentinel1' I would have the different modes and then as well the data matrices.
I've been thinking about lists, dictionaries or classes with attributes, but I really can't make up my mind, also since I don't have that much of experience.
At this stage, a little help in the right direction would be much appreciated!

Lists: Don't use lists for nested and complex data structures. You're just shooting yourself in the foot- code you write will be specialized to the exact format you are using, and any changes or additions will be brutal to implement.
Dictionaries: Aren't bad- they'll nest nicely and you can use a dictionary whose value is a dictionary to hold named info about the keys. This is probably the easiest choice.
Classes: Classes are really really useful for this if you need a lot of behavior to go with them - you want the string of it to be represented a certain way, you want to be able to use primitive operators for some functionality, or you just want to make the code slightly more readable or reusable.
From there, it's all your choice- if you want to go through the extra code (it's good for you) of writing them as classes, do it! Otherwise, dictionaries will get you where you need to go. Notably the only thing a dictionary couldn't do would be if you have two things that should be at the key level in the dictionary with the same name (Dicts don't do repetition).

Tree of trees? Table of trees? What kind of data structure have I created?

I am creating a python module that creates and operates on data structures to store lots of semantically tagged data and metadata from real experiments. So in an experiment you have:
subjects
treatments
replicates
Enclosing these 3 categories is the experiment, and combinations of the three categories are what I am calling "units". Now there is no inherently correct hierarchy between the 3 (table-like) but for certain analyses it is useful to think of a certain permutation of the 3 as a hierarchy,
e.g. (subjects-->(treatments-->(replicates)))
or
(replicates-->(treatments-->(subjects)))
Moreover, when collecting data, files will be copy-pasted into a folder on a desktop, so data is at least coming in as a tree. I have thought a lot about which hierarchy is "better" but I keep coming up with use cases for most of the 6 possible permutations. I want my module to be flexible in that the user can think of the experiment or collect the data using whatever hierarchy, table, hierarchy-table hybrid makes sense to them.
Also the "units" or (table entries) are containers for arbitrary amounts of data (bytes to Gigabytes, whatever ideally) of any organizational complexity. This is why I didn't think a relational database approach was really the way to go and a NoSQL type solution makes more sense. But then i have the problem of how to order the three categories if none is "correct".
So my question is what is this multifaceted data structure?
Does some sort of fluid data structure or set of algorithms exist to easily inter-convert or produce structured views?

The short answer is that HDF5 addresses these fairly common concerns and I would suggest it. http://www.hdfgroup.org/HDF5/
In python: http://docs.h5py.org/en/latest/high/group.html
http://odo.pydata.org/en/latest/hdf5.html
will help.

Data structure options for efficiently storing sets of integer pairs on disk?

I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.

How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.

Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.

You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.

How do I build an efficient email filter against large ruleset (5000+ and growing)

I'm building an email filter and I need a way to efficiently match a single email to a large number of filters/rules. The email can be matched on any of the following fields:
From name
From address
Sender name
Sender address
Subject
Message body
Presently there are over 5000 filters (and growing) which are all defined in a single table in our PostgreSQL (9.1) database. Each filter may have 1 or more of the above fields populated with a Python regular expression.
The way filtering is currently being done is to select all filters and load them into memory. We then iterate over them for each email until a positive match is found on all non-blank fields. Unfortunately this means for any one email there can potentially be as many as 30,000 (5000 x 6) re.match operations. Clearly this won't scale as more filters get added (actually it already doesn't).
Is there a better way to do this?
Options I've considered so far:
Converting saved python regular expressions to POSIX style ones to make use of PostgreSQL's SIMILAR TO expression. Will this really be any quicker? Seems to me like it's simply shifting the load somewhere else.
Defining filters on a per user basis. Though this isn't really practical because with our system users actually benefit from a wealth of predefined filters.
Switching to a document-based search engine like elastic search where the first email to be filtered is saved as the canonical representation. By finding similar emails we can then narrow down to a specific feature set to test on and get a positive match.
Switching to a bayes filter which would also give us some machine learning capability to detect similar emails or changes to existing emails that would still match with a high enough probability to guess that they were the same thing. This sounds cool but I'm not sure it would scale particularly well either.
Are there other options or approaches to consider?

The trigram support in PostgreSQL version 9.1 might give you what you want.
http://www.postgresql.org/docs/9.1/interactive/pgtrgm.html
It almost certainly will be a viable solution in 9.2 (scheduled for release in summer of 2012), since the new version knows how to use a trigram index for fast matching against regular expressions. At our shop we have found the speed of trigram indexes to be very good.
Also, if you ever want to do a "nearest neighbor" search, where you find the K best matches based on similarity to a search argument, a trigram index is wonderful -- it actually returns rows from the index scan in order of "distance". Search for KNN-GiST for write-ups.

how complex are these regexps? if they really are regular expressions (without all the crazy python extensions) then you can combine them all into a single regexp (as alternatives) and then use a simple (ie in-memory) regexp matcher.
i am not sure this will work, but i suspect that you will be pleasantly surprised that the regexp compiles to a significantly smaller state machine, because it will merge common states.
also, for a fast regular expression engine, consider using nrgrep which will fast scan. this should give you a speedup when jumping from one header to the next (i haven't use it myself, but they're friends of friends and the paper looked pretty neat when i read it).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.