How do you efficiently bulk index lookups?

How do you efficiently bulk index lookups? - python

I have these entity kinds:
Molecule
Atom
MoleculeAtom
Given a list(molecule_ids) whose lengths is in the hundreds, I need to get a dict of the form {molecule_id: list(atom_ids)}. Likewise, given a list(atom_ids) whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}.
Both of these bulk lookups need to happen really fast. Right now I'm doing something like:
atom_ids_by_molecule_id = {}
for molecule_id in molecule_ids:
moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
atom_ids_by_molecule_id[molecule_id] = [
MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
]
Like I said, len(molecule_ids) is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.
Ideas:
Will using a Molecule.atoms ListProperty do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.
Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.
How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling db.get on 500 keys is probably faster than looping through 500 fetches with filters, right?

Your third approach (denormalizing the data) is, generally speaking, the right one. In particular, db.get by keys is indeed about as fast as the datastore gets.
Of course, you'll need to denormalize the other way around too (entity with key name atom ID, value a list of molecule IDs) and will need to update everything carefully when atoms or molecules are altered, added, or deleted -- if you need that to be transactional (multiple such modifications being potentially in play at the same time) you need to arrange ancestor relationships.. but I don't see how to do it for both molecules and atoms at the same time, so maybe that could be a problem. Maybe, if modifications are rare enough (and depending on other aspects of your application), you could serialize the modifications in queued tasks.

Related

What's the difference between index and internal ID in neo4j?

I'm setting up my database and sometimes I'll need to use an ID. At first, I added an ID as a property to my nodes of interest but realized I could also just use neo4j's internal id "". Then I stumbled upon the CREATE INDEX ON :label(something) and was wondering exactly what this would do? I thought an index and the would be the same thing?
This might be a stupid question, but since I'm kind of a beginner in databases, I may be missing some of these concepts.
Also, I've been reading about which kind of database to use (mySQL, MongoDB or neo4j) and decided on neo4j since my data pretty much follows a graph structure. (it will be used to build metabolic models: connections genes->proteins->reactions->compounds)
In SQL the syntax just seemed too complex as I had to go around several tables to make simple connections that neo4j accomplishes quite easily...
From what I understand MongoDb stores data independently, and, since my data is connected, it doesnt really seem to fit the data structure.
But again, since my knowledge on this subject is limited, perhaps I'm not doing the right choice?
Thanks in advance.

Graph dbs are ideal for connected data like this, it's a more natural fit for both storing and querying than relational dbs or document stores.
As far as indexes and ids, here's the index section of the docs, but the gist of it is that this has to do with how Neo4j can look up starting nodes. Neo4j only uses indexes for finding these starting nodes (though in 3.5 when we do index lookup like this, if you have ORDER BY on the indexed property, it will use the index to augment the performance of the ordering).
Here is what Neo4j will attempt to use, depending on availability, from fastest to slowest:
Lookup by internal ID - This is always quick, however we don't recommend preserving these internal ids outside the context of a query. The reason for that is that when graph elements are deleted, their ids become eligible for reuse. If you preserve the internal ids outside of Neo4j, and perform a lookup with them later, there is a chance that whatever you expected it to reference could have been deleted, and may point at nothing, or may point at some new node with completely different data.
Lookup by index - This where you would want to use CREATE INDEX ON (or add a unique constraint, if that makes sense for your model). When you use a MATCH or MERGE using the label and property (or properties) associated with the index, then this is a fast and direct lookup of the node(s) you want.
Lookup by label scan - If you perform a MATCH with a label present in the pattern, but no means to use an index (either no index present for the label/property combination, or only a label is present but no property), then a label scan will be performed, and every node of the given label will be matched to and filtered. This becomes more expensive as more nodes with those labels are added.
All nodes scan - If you do not supply any label in your MATCH pattern, then every node in your db will be scanned and filtered. This is very expensive as your db grows.
You can EXPLAIN or PROFILE a query to see its query plan, which will show you which means of lookup are used to find the starting nodes, and the rest of the operations for executing the query.
Once a starting node or nodes are found, then Neo4j uses relationship traversal and filtering to expand and find all paths matching your desired pattern.

Python dictionary of sets in SQL

I have a dictionary in Python where the keys are integers and the values sets of integers. Considering the potential size (millions of key-value pairs, where a set can contain from 1 to several hundreds of integers), I would like to store it in a SQL (?) database, rather than serialize it with pickle to store it and load it back in whenever I need it.
From reading around I see two potential ways to do this, both with its downsides:
Serialize the sets and store them as BLOBs: So I would get an SQL with two columns, the first column are the keys of the dictionary as INTEGER PRIMARY KEY, the second column are the BLOBS, containing a set of integers.
Downside: Not able to alter sets anymore without loading the complete BLOB in, and after adding a value to it, serialize it back and insert it back to the database as a BLOB.
Add a unique key for each element of each set: I would get two columns, one with the keys (which are now key_dictionary + index element of set/list), one with one integer value in each row. I'd now be able to add values to a "set" without having to load the whole set into python. I would have to put more work in keeping track of all the keys.
In addition, once the database is complete, I will always need sets as a whole, so idea 1 seems to be faster? If I query for all in primary keys BETWEEN certain values, or LIKE certain values, to obtain my whole set in system 2, will the SQL database (sqlite) still work as a hashtable? Or will it linearly search for all values that fit my BETWEEN or LIKE search?
Overall, what's the best way to tackle this problem? Obviously, if there's a completely different 3rd way that solves my problems naturally, feel free to suggest it! (haven't found any other solution by searching around)
I'm kind of new to Python and especially to databases, so let me know if my question isn't clear. :)

You second answer is nearly what I would recommend. What I would do is have three columns:
Set ID
Key
Value
I would then create a composite primary key on the Set ID and Key which guarantees that the combination is unique:
CREATE TABLE something (
set,
key,
value,
PRIMARY KEY (set, key)
);
You can now add a value straight into a particular set (Or update a key in a set) and select all keys in a set.
This being said, your first strategy would be more optimal for read-heavy workloads as the size of the indexes would be smaller.
will the SQL database (sqlite) still work as a hashtable?
SQL databases tend not to use hashtables. Nor do they usually do a sequential lookup. What they do is usually create an index (Which tends to be some kind of tree, e.g. a B-tree) which allows for range lookups (e.g. where you don't know exactly what keys you're looking for).

Dictionary, set or frozenset?

I have a large collection of data, about 10 million entries and part of my program required very many membership checks...
if a in data:
return True
return False
right now I have data as dictionary entries with all their values equal to '1'
I also have a program that uses an algorithm to figure out the same information, but for now it is slower then the dictionary method however I expect the size of data to continue growing...
For my current dictionary solution, would type(data) as a frozenset, or set (or something else?) be faster?
And for the future to find out when I need to switch to my program, does anyone know how the speed of checking membership correlated with increasing the size of a type that's hashable?
Is a dictionary with 1 billion entries still fast?

On Principal
If you expect the data to keep growing you can't use a frozenset.
A set would be smaller than a dictionary storage wise for testing if an element exist in it. It would be similar in speed to a dictionary lookup since the keys and items of a set are both hashed for storage and always unique. If you don't need data associated with the username, use a set.
Practically speaking...
When you are dealing with that many entries move the data to a database. You will eventually run out of memory trying to store and read all of that into memory. With a database, you can issue a specific query to check membership. Seriously. Put that data in a database.

For this amount of data RyPeck is right - a DB will do the job much better.
One more point:
Something seems odd to me in what you've written:
If you use a dictionary to store the objects of the memberships, what the value of said key-value pair in the dictionary is '1'? Shouldn't the key-value pair of the dictionary be: "id of a"-"a" where 'a' is the object.

There are several bytes overhead per entry in a hash-able (whether dictionary or set doesn't make much of a difference), so for billions of entries you will run into swapping unless you have 32+Gb of memory for the application. I would start looking for a fast DB
For frozenset you also need to have all data in memory in some acceptable form at creation time, which probably doubles the amount of mem needed

Do python references use memory?

I am building a flexible, lightweight, in-memory database in Python, and discovered a performance problem with the way I was looking up values and using indexes. In an effort to improve this I've tried a few options, trying to balance speed with memory usage. My current implementation uses a dict of dicts to store data by record (object reference) and field (also an object reference). So for example, if I have three records with three fields, where some of the data is missing (i.e. NULL values)::
{<Record1>: {<Field1>: 4, <Field2>: 'value', <Field3>: <Other Record>},
{<Record2>: {<Field1>: 4, <Field2>: 'value'},
{<Record3>: {<Field1>: 5}}
I considered a numpy array, but I would still need two dictionaries to map object instances to array indexes, so I can't see that it will perform be any better.
Indexes are implemented using a pair of bisected lists, essentially acting as a map from value to record instance. For example, and index on the above Field1>:
[[4, 4, 5], [<Record1>, <Record2>, <Record3>]]
I was previously using a simple dict of bins, but this didn't allow range lookups (e.g. all values > 5) (see Python hash table for fuzzy matching).
My question is this. I am concerned that I have several object references, and multiple copies of the same values in the indexes. Do all these duplicate references actually use more memory, or are references cheap in python? My alternative is to try to associate a numerical key to each object, which might improve things at least up to 256, but I don't know enough about how python handles references to know if this would really be any better.
Does anyone have any suggestions of a better way to manage this?
Reimplementing the critical parts in C is an option I want to keep as a last resort.
For anyone interested, my code is here.
Edit 1:
The question, simple put, is which of the following is more efficient in terms of memory usage, where a is an object instance and i is an integer:
[a] * 1000
Or
[i] * 1000, {a: i}
Edit 2:
Because of the large number of comments suggesting I use an existing system, here are my requirements. If anyone can suggest a system which fulfills all of these, that would be great, but so far I have not found anything which does. Otherwise, my original question still relates to memory usage of references in python.:
Must be light-weight and in-memory. Definitely not a client/server model.
Need to be able to easily alter tables, change fields, change rules, etc, on the fly.
Need to easily apply very complex validation rules. SQL doesn't meet this requirement. Although it is sometimes possible to build up very complicated statements, it is far from easy.
Need to support joins and associations between tables. Many NoSQL databases don't support joins at all, or at most only simple joins.
Need to support a method of loading and storing data to any file format. I am currently implementing this by providing a framework which makes it easy to add new formats as needed.
It does not need persistence (beyond storing data as in the previous point), and does not need to handle massive amounts of data, i.e. not more than a couple of million records. Typically, I am dealing with a few thousand.

Each reference is in effect a pointer, each pointer requires a small amount of memory.
You can use memory profiler to view memory use on a line by line basis. In this way you can see what happens when you make a reference.

Python does not specify a particular implementation for dynamic memory management, but from the semantics of the language one can assume that a reference uses memory similar to a C pointer.

FWIW, I ran some tests on a 100x100 structure, testing a sparsely populated dictionary structure, a fully populated dictionary structure, a list, and a numpy array. The latter two had a dictionary mapping object references to indexes. I timed getting every item in the structure by index (returning a sentinel for missing data in the sparse dict), and also reported the total size. My results were somewhat surprising:
Structure Time Size
============= ======== =====
full dict 0.0236s 6284
list 0.0426s 13028
sparse dict 0.1079s 1676
array 0.2262s 12608
So the fastest and second smallest was a full dict, presumable because there was no need to run a key in dict check on it.

merging dictionaries in python

Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!

A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).

Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count

If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.