Mongodb with Python's "set()" type

Mongodb with Python's "set()" type - python

I am building a web App with mongoDB as the backend. Some of the documents need to store a collection of items in some sort of list, and then the system will need to frequently check if a specified item is present in that list. Using Python's 'in' operator takes Big-O(N) time, n being the size of the list. Since these list can get quite large, I want something faster than that. Python's 'set' type does this operation in constant time (and enforces uniqueness, which is good in my case), but is considered an invalid data type to put in MongoDB.
So what's the best way to do this? Is there some way to just use a regular list and exploit mongo's indexing features? Again, I want to know, for a given document in a collection, does a list inside that document contain particular element?

You can represent a set using a dictionary. Your elements become the keys, and all the values can be set to a constant such as 1. The in operator checks for the existence of a key.
EDIT. MongoDB stores a dict as a BSON document, where the keys must be strings (with some additional restrictions), so the above advice is of limited use.

Related

python data structures for storing classes

I am trying to design a system in python where my customers can create an order and it will be stored in an array or similar type of structure that will be able to constantly expand to store more orders as they are placed. What is the best way to do this?

I can think of two ways to do this.
Serialization. Reference
Create two tables, One called Order and other called order_contents. You can join order and Order_contents by order id. Store Order specific data in order table and content specific data in conetnt. All contents can be retrieved with a single SQL query this way OR in python, easily with ORM.

How big would you expect and order to get and how many orders could there be? Also what is stored in an order?
If you use numpy arrays you have the problem that increasing the size of an array is a very expensive process, so doing it many times on large arrays would be problematic. Numpy arrays also are for data that is all the same type, so you would not use this for things that are combinations of strings (name, item), integers (item reference), and floats (cost).
A simple list is likely the easiest and most inclusive choice. You can put whatever you like in a list and increase the size easily since a list is actually just pointers.
A dictionary could be useful if you are expecting to have to search the list often, or have a clear key-item relationship.
It really comes down to your use case. A list is often the choice, but a dictionary could be nice, and numpy arrays are nice if you are doing math with the stored data.

Python dictionary of sets in SQL

I have a dictionary in Python where the keys are integers and the values sets of integers. Considering the potential size (millions of key-value pairs, where a set can contain from 1 to several hundreds of integers), I would like to store it in a SQL (?) database, rather than serialize it with pickle to store it and load it back in whenever I need it.
From reading around I see two potential ways to do this, both with its downsides:
Serialize the sets and store them as BLOBs: So I would get an SQL with two columns, the first column are the keys of the dictionary as INTEGER PRIMARY KEY, the second column are the BLOBS, containing a set of integers.
Downside: Not able to alter sets anymore without loading the complete BLOB in, and after adding a value to it, serialize it back and insert it back to the database as a BLOB.
Add a unique key for each element of each set: I would get two columns, one with the keys (which are now key_dictionary + index element of set/list), one with one integer value in each row. I'd now be able to add values to a "set" without having to load the whole set into python. I would have to put more work in keeping track of all the keys.
In addition, once the database is complete, I will always need sets as a whole, so idea 1 seems to be faster? If I query for all in primary keys BETWEEN certain values, or LIKE certain values, to obtain my whole set in system 2, will the SQL database (sqlite) still work as a hashtable? Or will it linearly search for all values that fit my BETWEEN or LIKE search?
Overall, what's the best way to tackle this problem? Obviously, if there's a completely different 3rd way that solves my problems naturally, feel free to suggest it! (haven't found any other solution by searching around)
I'm kind of new to Python and especially to databases, so let me know if my question isn't clear. :)

You second answer is nearly what I would recommend. What I would do is have three columns:
Set ID
Key
Value
I would then create a composite primary key on the Set ID and Key which guarantees that the combination is unique:
CREATE TABLE something (
set,
key,
value,
PRIMARY KEY (set, key)
);
You can now add a value straight into a particular set (Or update a key in a set) and select all keys in a set.
This being said, your first strategy would be more optimal for read-heavy workloads as the size of the indexes would be smaller.
will the SQL database (sqlite) still work as a hashtable?
SQL databases tend not to use hashtables. Nor do they usually do a sequential lookup. What they do is usually create an index (Which tends to be some kind of tree, e.g. a B-tree) which allows for range lookups (e.g. where you don't know exactly what keys you're looking for).

Efficient way to check existence in a large set of strings

I have a set of 100+ million strings, each up to 63 characters long. I have lots of disk space and very little memory (512 MB). I need to query for existence alone, and store no additional metadata.
My de facto solution is a BDB btree. Are there any preferable alternatives? I'm aware of leveldb and Kyoto Cabinet, but not familiar enough to identify advantages.

If false positives are acceptable, then one possible solution would be to use a bloom filter. Bloom filters are similar to hash tables, but instead of using one hash value to index a table of buckets, it uses multiple hashes to index a bit array. The bits corresponding to those indices are set. Then, to test if a string is in the filter, the string is hashed again, and if the corresponding indices are set, then the string is "in" the filter.
It doesn't store any information about the strings, so it uses very little memory -- but if there's a collision between two strings, no collision resolution is possible. This means there may be false positives (because a string not in the filter may hash to the same indices as a string that is in the filter). However, there can be no false negatives; any string that really is in the set will be found in the bloom filter.
There are a few Python implementations. It's also not hard to roll your own; I recall coding up a quick-and-dirty bloom filter myself once using bitarrays that performed pretty well.

You said you have lots of disk, huh? One option would be to store the strings as filenames in nested subdirectories. You could either use the strings directly:
Store "drew sears" in d/r/e/w/ sears
or by taking a hash of the string and following a similar process:
MD5('drew sears') = 'f010fe6e20d12ed895c10b93b2f81c6e'
Create an empty file named f0/10/fe/6e/20d12ed895c10b93b2f81c6e.
Think of it as an OS-optimized, hash-table based indexed NoSQL database.
Side benefits:
You could change your mind at a later time and store data in the file.
You can replicate your database to another system using rsync.

merging dictionaries in python

Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!

A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).

Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count

If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.

How to store a dynamic List into MySQL column efficiently?

I want to store a list of numbers along with some other fields into MySQL. The number of elements in the list is dynamic (some time it could hold about 60 elements)
Currently I'm storing the list into a column of varchar type and the following operations are done.
e.g. aList = [1234122433,1352435632,2346433334,1234122464]
At storing time, aList is coverted to string as below
aListStr = str(aList)
and at reading time the string is converted back to list as below.
aList = eval(aListStr)
There are about 10 million rows, and since I'm storing as strings, it occupies lot space. What is the most efficient way to do this?
Also what should be the efficient way for storing list of strings instead of numbers?

Since you wish to store integers, an effective way would be to store them in an INT/DECIMAL column.
Create an additional table that will hold these numbers and add an ID column to relate the records to other table(s).
Also what should be the efficient way
for storing list of strings instead of
numbers?
Beside what I said, you can convert them to HEX code which will be very easy & take less space.
Note that a big VARCHAR may influence badly on the performance.
VARCHAR(2) and VARCHAR(50) does matter when actions like sotring are done, since MySQL allocates fixed-size memory slices for them, according to the VARCHAR maximum size.
When those slices are too large to store in memory, MySQL will store them on disk.

MySQL also has a SET type, it works like ENUM but can hold multiple items.
Of course you'd have to have a limited list, currently MySQL only supports up to 64 different items.

I'd be less worried about storage space and more worried about record retrieveal i.e., indexability/searching.
For example, I imagine performing a LIKE or REGEXP in a WHERE clause to find a single item in the list will be quite bit more expensive than if you normalized each list item into a row in a separate table.
However, if you never need to perform such queries agains these columns, then it just won't matter.

Since you are using relational database you should know that storing non-atomic values in individual fields breaks even the first normal form. More likely than not you should follow Don's advice and keep those values in related table. I can't say that for certain because I don't know your problem domain. It may well be that choosing RDBMS for this data was a bad choice altogether.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.