I have a (key, value) map where for each key I have a somewhat large list of heterogeneous lists (~max about 250 items). Each list is a mix of strings and numbers that I might want to iterate over. The key is a string. If I wanted to store such a list with thousands of such (key, value) pairs persistently for efficient retrieval what are the best options? If I use sqlite then I would need to create a table for each key and then map the lists to individual records in the database. Are there better and efficient options if the goal is fast retrieval of the list of lists for a particular key?
Here is a short example. Say animals is a map of keys to list of lists. Sample data looks like this:
animals = {
"Lion" : [["Siberian", 203, "Tanzania", 123.56], ["Russian", 321, "Timbktu", 23423.2]],
"Tiger: [["White", 121, "Australia", 1211.1], ["Indian", 111, "India", 1241.5]]
}
So I want to be able to persist this data structure and be able to quickly index by the name of an animal (always unique) and get the list of lists for the particular animal I care about. If the lists within each animal's info is of fixed length and fixed fields, can I exploit that feature somehow to improve efficiency?
As Blender states in the comment, pickle is a reasonable choice. Make sure not to use the original version, though, and instead use the C-based cPickle. Alternatively, consider dill.
I would suggest one of the fast JSON libraries. There are several speed comparisons online that suggest that JSON can be as fast or rather faster than pickle. Check this one for example:
http://lvsl.github.io/2011/12/28/python-serialization-benchmark.html
and
https://blog.hartleybrody.com/python-serialize/
There are several JSON serialization alternatives, and again, there are some comparisons online, e.g.
https://medium.com/#jyotiska/json-vs-simplejson-vs-ujson-a115a63a9e26
I would suggest looking into ujson, which seems to be really fast and has one big advantage over e.g. pickle, it's very easy to inspect the data as they are saved in a human readable format. On the other hand pickle will be a bit easier to use with custom types, although you can still define custom encoders for custom types for JSON. Overall, choose JSON if you care more about human readability, and pickle if what really matters is having a few lines of code less for custom types.
Depending on your needs, you may want to consider REDIS which is an excellent key:value database solution. This tutorial provides a relatively quick introduction.
Related
I am trying to design a system in python where my customers can create an order and it will be stored in an array or similar type of structure that will be able to constantly expand to store more orders as they are placed. What is the best way to do this?
I can think of two ways to do this.
Serialization. Reference
Create two tables, One called Order and other called order_contents. You can join order and Order_contents by order id. Store Order specific data in order table and content specific data in conetnt. All contents can be retrieved with a single SQL query this way OR in python, easily with ORM.
How big would you expect and order to get and how many orders could there be? Also what is stored in an order?
If you use numpy arrays you have the problem that increasing the size of an array is a very expensive process, so doing it many times on large arrays would be problematic. Numpy arrays also are for data that is all the same type, so you would not use this for things that are combinations of strings (name, item), integers (item reference), and floats (cost).
A simple list is likely the easiest and most inclusive choice. You can put whatever you like in a list and increase the size easily since a list is actually just pointers.
A dictionary could be useful if you are expecting to have to search the list often, or have a clear key-item relationship.
It really comes down to your use case. A list is often the choice, but a dictionary could be nice, and numpy arrays are nice if you are doing math with the stored data.
Since I recently started a new project, I'm stuck in the "think before you code" phase. I've always done basic coding, but I really think I now need to carefully plan how I should organize the results that are produced by my script.
It's essentially quite simple: I have a bunch of satellite data I'm extracting from Google Earth Engine, including different sensors, different acquisition modes, etc. What I would like to do is to loop through a list of "sensor-acquisition_mode" couples, request the data, do some more processing, and finally save it to a variable or file.
Suppose I have the following example:
sensors = ['landsat','sentinel1']
sentinel_modes = ['ASCENDING','DESCENDING']
sentinel_polarization = ['VV','VH']
In the end, I would like to have some sort of nested data structure that at the highest level has the elements 'landsat' and 'sentinel1'; under 'landsat' I would have a time and values matrix; under 'sentinel1' I would have the different modes and then as well the data matrices.
I've been thinking about lists, dictionaries or classes with attributes, but I really can't make up my mind, also since I don't have that much of experience.
At this stage, a little help in the right direction would be much appreciated!
Lists: Don't use lists for nested and complex data structures. You're just shooting yourself in the foot- code you write will be specialized to the exact format you are using, and any changes or additions will be brutal to implement.
Dictionaries: Aren't bad- they'll nest nicely and you can use a dictionary whose value is a dictionary to hold named info about the keys. This is probably the easiest choice.
Classes: Classes are really really useful for this if you need a lot of behavior to go with them - you want the string of it to be represented a certain way, you want to be able to use primitive operators for some functionality, or you just want to make the code slightly more readable or reusable.
From there, it's all your choice- if you want to go through the extra code (it's good for you) of writing them as classes, do it! Otherwise, dictionaries will get you where you need to go. Notably the only thing a dictionary couldn't do would be if you have two things that should be at the key level in the dictionary with the same name (Dicts don't do repetition).
I am building a flexible, lightweight, in-memory database in Python, and discovered a performance problem with the way I was looking up values and using indexes. In an effort to improve this I've tried a few options, trying to balance speed with memory usage. My current implementation uses a dict of dicts to store data by record (object reference) and field (also an object reference). So for example, if I have three records with three fields, where some of the data is missing (i.e. NULL values)::
{<Record1>: {<Field1>: 4, <Field2>: 'value', <Field3>: <Other Record>},
{<Record2>: {<Field1>: 4, <Field2>: 'value'},
{<Record3>: {<Field1>: 5}}
I considered a numpy array, but I would still need two dictionaries to map object instances to array indexes, so I can't see that it will perform be any better.
Indexes are implemented using a pair of bisected lists, essentially acting as a map from value to record instance. For example, and index on the above Field1>:
[[4, 4, 5], [<Record1>, <Record2>, <Record3>]]
I was previously using a simple dict of bins, but this didn't allow range lookups (e.g. all values > 5) (see Python hash table for fuzzy matching).
My question is this. I am concerned that I have several object references, and multiple copies of the same values in the indexes. Do all these duplicate references actually use more memory, or are references cheap in python? My alternative is to try to associate a numerical key to each object, which might improve things at least up to 256, but I don't know enough about how python handles references to know if this would really be any better.
Does anyone have any suggestions of a better way to manage this?
Reimplementing the critical parts in C is an option I want to keep as a last resort.
For anyone interested, my code is here.
Edit 1:
The question, simple put, is which of the following is more efficient in terms of memory usage, where a is an object instance and i is an integer:
[a] * 1000
Or
[i] * 1000, {a: i}
Edit 2:
Because of the large number of comments suggesting I use an existing system, here are my requirements. If anyone can suggest a system which fulfills all of these, that would be great, but so far I have not found anything which does. Otherwise, my original question still relates to memory usage of references in python.:
Must be light-weight and in-memory. Definitely not a client/server model.
Need to be able to easily alter tables, change fields, change rules, etc, on the fly.
Need to easily apply very complex validation rules. SQL doesn't meet this requirement. Although it is sometimes possible to build up very complicated statements, it is far from easy.
Need to support joins and associations between tables. Many NoSQL databases don't support joins at all, or at most only simple joins.
Need to support a method of loading and storing data to any file format. I am currently implementing this by providing a framework which makes it easy to add new formats as needed.
It does not need persistence (beyond storing data as in the previous point), and does not need to handle massive amounts of data, i.e. not more than a couple of million records. Typically, I am dealing with a few thousand.
Each reference is in effect a pointer, each pointer requires a small amount of memory.
You can use memory profiler to view memory use on a line by line basis. In this way you can see what happens when you make a reference.
Python does not specify a particular implementation for dynamic memory management, but from the semantics of the language one can assume that a reference uses memory similar to a C pointer.
FWIW, I ran some tests on a 100x100 structure, testing a sparsely populated dictionary structure, a fully populated dictionary structure, a list, and a numpy array. The latter two had a dictionary mapping object references to indexes. I timed getting every item in the structure by index (returning a sentinel for missing data in the sparse dict), and also reported the total size. My results were somewhat surprising:
Structure Time Size
============= ======== =====
full dict 0.0236s 6284
list 0.0426s 13028
sparse dict 0.1079s 1676
array 0.2262s 12608
So the fastest and second smallest was a full dict, presumable because there was no need to run a key in dict check on it.
for documents with lists with pagination, is it better to embed or use
reference? im reading the custom type "SONManipulator" and it appears
to transform every thing on retrieval, even the sub docs.
i want to keep the list in the document sorted, should this impact
anything?
I don't fully understand your question, but it is generally better to embed documents for performance reasons. That is one of the major advantages of MongoDB's approach, data locality. The pymongo lib uses SON sorted dict implementation which will maintain the ordering of your document keys.
If you document contains a list/array of elements and you are concerned about the ordering of the elements, fear not because the array order is maintained as well.
Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!
A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).
Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count
If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.