I am programming some kind of simulation with its data organised in a tree. The main object is World which holds a bunch of methods and a list of City objects. Each City object in turn has a bunch of methods and a list of Population objects. Population objects have no method of their own, they merely hold attributes.
My question regards the latter Population objects, which I can either derive from object or create as dictionaries. What is the most efficient way to organise these?
Here are few cases which illustrate my hesitation:
Saving the Data
I need to be able to save and load the simulation, for which purpose I use the built-in json (I want the data to be human readable). Because of the program is organised in a tree, saving data at each level can be cumbersome. In this case, the population is best kept as a dictionary appended to a population list as an attribute of a City instance. This way, saving is a mere matter of passing the City instance's __dict__ into Json.
Using the Data
If I want to manipulate the population data, it is easier as a class instance than as a dictionary. Not only is the syntax simple, but I can also enjoy introspection features better while coding.
Performance
I am not sure, finally, as to what is the most efficient in terms of resources. An object and a dictionary have little difference in the end, since each object has a __dict__ attribute, which can be used to access all its attributes. If i run my simulation with large numbers of City and Population objects, what will be using the less resources: objects or dictionaries?
So again, what is the most efficient way to organise data in a tree? Are dictionaries or objects preferable? Or is there any secret to organising the data trees?
Why not a hybrid dict/object?
class Population(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
Now you can easily access known names via attributes (foo.bar), while still having the dict functionality to easily access unknown names, iterate over them, etc. without the clunky getattr/setattr syntax.
If you want to always initialize them with particular fields, you can add an __init__ method:
def __init__(self, starting=0, birthrate=100, imrate=10, emrate=10, deathrate=100):
self.update(n=starting, b=birthrate, i=imrate, e=emrate, d=deathrate)
As you've seen yourself, there is little practical difference - the main difference, in my opinion, is that using individual, hard-coded attributes is slightly easier with objects (no need to quote the name) while dicts easily allow treating all values as one collection (e.g. summing them). This is why I'd go for objects, since the data of the population objects is likely heterogenous and relatively independent.
I think you should consider using a namedtuple (see the Python docs on the collections module). You get to access the attributes of the Population object by name like you would with a normal class, e.g. population.attribute_name instead of population['attribute_name'] for a dictionary. Since you're not putting any methods on the Population class this is all you need.
For your "saving data" criterion, there's also an _asdict method which returns a dictionary of field names to values that you could pass to json. (You might need to be careful about exactly what you get back from this method depending on which version of Python you're using. Some versions return a dictionary, and some return an OrderedDict. This might not make any difference for your purposes.)
namedtuples are also pretty lightweight, so they also work with your 'Running the Simulation' resource requirement. However, I'd echo other people's caution in saying not to worry about that, there's going to be very little difference unless you're doing some serious data-crunching.
I'd say that in every case a Population is a member of a City, and if it's data only, why not use a dictionary?
Don't worry about performance, but if your really need to know I think a dict is faster.
Related
I'm relatively new to coding in python. I know most of the syntax, but actually applying it within real programs is new to me. I wanted to know, if I have some item with multiple properties, how would I store that?
I am creating a Japanese learning tool, following a textbook, and I need a way to store, and later access the vocabulary. For example...
If I have the word おはよう, this in romanized type is "ohayou", and its definition is "Good Morning", also this vocab is located in "Lesson 1" of the book.
I was thinking of creating a dictionary, with maybe a tuple/array/list for the value, or key to store more properties per vocab word. Then I thought maybe I could use a class as well, but thought I would need a class for each vocab word as objects? I just want to know what would be the most efficient, and easy storage method for these vocab words and all their different English, and Japanese properties.
This is pretty broad, but your guiding consideration should be that you need to provide hashable keys. Given your constraints, I think a tuple (or namedtuple) would fit well. Namedtuple operates like a record, or lightweight class, so you can get the benefits of calling with dot notation while having an immutable data structure.
Since I recently started a new project, I'm stuck in the "think before you code" phase. I've always done basic coding, but I really think I now need to carefully plan how I should organize the results that are produced by my script.
It's essentially quite simple: I have a bunch of satellite data I'm extracting from Google Earth Engine, including different sensors, different acquisition modes, etc. What I would like to do is to loop through a list of "sensor-acquisition_mode" couples, request the data, do some more processing, and finally save it to a variable or file.
Suppose I have the following example:
sensors = ['landsat','sentinel1']
sentinel_modes = ['ASCENDING','DESCENDING']
sentinel_polarization = ['VV','VH']
In the end, I would like to have some sort of nested data structure that at the highest level has the elements 'landsat' and 'sentinel1'; under 'landsat' I would have a time and values matrix; under 'sentinel1' I would have the different modes and then as well the data matrices.
I've been thinking about lists, dictionaries or classes with attributes, but I really can't make up my mind, also since I don't have that much of experience.
At this stage, a little help in the right direction would be much appreciated!
Lists: Don't use lists for nested and complex data structures. You're just shooting yourself in the foot- code you write will be specialized to the exact format you are using, and any changes or additions will be brutal to implement.
Dictionaries: Aren't bad- they'll nest nicely and you can use a dictionary whose value is a dictionary to hold named info about the keys. This is probably the easiest choice.
Classes: Classes are really really useful for this if you need a lot of behavior to go with them - you want the string of it to be represented a certain way, you want to be able to use primitive operators for some functionality, or you just want to make the code slightly more readable or reusable.
From there, it's all your choice- if you want to go through the extra code (it's good for you) of writing them as classes, do it! Otherwise, dictionaries will get you where you need to go. Notably the only thing a dictionary couldn't do would be if you have two things that should be at the key level in the dictionary with the same name (Dicts don't do repetition).
I have a (key, value) map where for each key I have a somewhat large list of heterogeneous lists (~max about 250 items). Each list is a mix of strings and numbers that I might want to iterate over. The key is a string. If I wanted to store such a list with thousands of such (key, value) pairs persistently for efficient retrieval what are the best options? If I use sqlite then I would need to create a table for each key and then map the lists to individual records in the database. Are there better and efficient options if the goal is fast retrieval of the list of lists for a particular key?
Here is a short example. Say animals is a map of keys to list of lists. Sample data looks like this:
animals = {
"Lion" : [["Siberian", 203, "Tanzania", 123.56], ["Russian", 321, "Timbktu", 23423.2]],
"Tiger: [["White", 121, "Australia", 1211.1], ["Indian", 111, "India", 1241.5]]
}
So I want to be able to persist this data structure and be able to quickly index by the name of an animal (always unique) and get the list of lists for the particular animal I care about. If the lists within each animal's info is of fixed length and fixed fields, can I exploit that feature somehow to improve efficiency?
As Blender states in the comment, pickle is a reasonable choice. Make sure not to use the original version, though, and instead use the C-based cPickle. Alternatively, consider dill.
I would suggest one of the fast JSON libraries. There are several speed comparisons online that suggest that JSON can be as fast or rather faster than pickle. Check this one for example:
http://lvsl.github.io/2011/12/28/python-serialization-benchmark.html
and
https://blog.hartleybrody.com/python-serialize/
There are several JSON serialization alternatives, and again, there are some comparisons online, e.g.
https://medium.com/#jyotiska/json-vs-simplejson-vs-ujson-a115a63a9e26
I would suggest looking into ujson, which seems to be really fast and has one big advantage over e.g. pickle, it's very easy to inspect the data as they are saved in a human readable format. On the other hand pickle will be a bit easier to use with custom types, although you can still define custom encoders for custom types for JSON. Overall, choose JSON if you care more about human readability, and pickle if what really matters is having a few lines of code less for custom types.
Depending on your needs, you may want to consider REDIS which is an excellent key:value database solution. This tutorial provides a relatively quick introduction.
Apologize if this is a duplicate, tried searching.
I understand that everything in python is a data type, but this is what I'm a bit confused about. So everything is an object, we have the collection class, integer, float, and other classes as children of the parent object class thinking of it as a tree of objects with say lists, tuples, and dictionaries as a child to the collection class.
So when we arbitrarily create our own data type class and defining the specific types of children / data types it will be able to handle, are we just redefining what python already has implemented?
For example, say we create our own data class to handle list using the super() method and then specify that only integers could be placed into this list or strings, or whatever we may wish to be contained within this data type.
Sorry if this question is a bit confusing tried to word it as precisely as I could.
So when we arbitrarily create our own data type class and defining the
specific types of children / data types it will be able to handle, are
we just redefining what python already has implemented?
This just has one short answer: yes.
I am building a flexible, lightweight, in-memory database in Python, and discovered a performance problem with the way I was looking up values and using indexes. In an effort to improve this I've tried a few options, trying to balance speed with memory usage. My current implementation uses a dict of dicts to store data by record (object reference) and field (also an object reference). So for example, if I have three records with three fields, where some of the data is missing (i.e. NULL values)::
{<Record1>: {<Field1>: 4, <Field2>: 'value', <Field3>: <Other Record>},
{<Record2>: {<Field1>: 4, <Field2>: 'value'},
{<Record3>: {<Field1>: 5}}
I considered a numpy array, but I would still need two dictionaries to map object instances to array indexes, so I can't see that it will perform be any better.
Indexes are implemented using a pair of bisected lists, essentially acting as a map from value to record instance. For example, and index on the above Field1>:
[[4, 4, 5], [<Record1>, <Record2>, <Record3>]]
I was previously using a simple dict of bins, but this didn't allow range lookups (e.g. all values > 5) (see Python hash table for fuzzy matching).
My question is this. I am concerned that I have several object references, and multiple copies of the same values in the indexes. Do all these duplicate references actually use more memory, or are references cheap in python? My alternative is to try to associate a numerical key to each object, which might improve things at least up to 256, but I don't know enough about how python handles references to know if this would really be any better.
Does anyone have any suggestions of a better way to manage this?
Reimplementing the critical parts in C is an option I want to keep as a last resort.
For anyone interested, my code is here.
Edit 1:
The question, simple put, is which of the following is more efficient in terms of memory usage, where a is an object instance and i is an integer:
[a] * 1000
Or
[i] * 1000, {a: i}
Edit 2:
Because of the large number of comments suggesting I use an existing system, here are my requirements. If anyone can suggest a system which fulfills all of these, that would be great, but so far I have not found anything which does. Otherwise, my original question still relates to memory usage of references in python.:
Must be light-weight and in-memory. Definitely not a client/server model.
Need to be able to easily alter tables, change fields, change rules, etc, on the fly.
Need to easily apply very complex validation rules. SQL doesn't meet this requirement. Although it is sometimes possible to build up very complicated statements, it is far from easy.
Need to support joins and associations between tables. Many NoSQL databases don't support joins at all, or at most only simple joins.
Need to support a method of loading and storing data to any file format. I am currently implementing this by providing a framework which makes it easy to add new formats as needed.
It does not need persistence (beyond storing data as in the previous point), and does not need to handle massive amounts of data, i.e. not more than a couple of million records. Typically, I am dealing with a few thousand.
Each reference is in effect a pointer, each pointer requires a small amount of memory.
You can use memory profiler to view memory use on a line by line basis. In this way you can see what happens when you make a reference.
Python does not specify a particular implementation for dynamic memory management, but from the semantics of the language one can assume that a reference uses memory similar to a C pointer.
FWIW, I ran some tests on a 100x100 structure, testing a sparsely populated dictionary structure, a fully populated dictionary structure, a list, and a numpy array. The latter two had a dictionary mapping object references to indexes. I timed getting every item in the structure by index (returning a sentinel for missing data in the sparse dict), and also reported the total size. My results were somewhat surprising:
Structure Time Size
============= ======== =====
full dict 0.0236s 6284
list 0.0426s 13028
sparse dict 0.1079s 1676
array 0.2262s 12608
So the fastest and second smallest was a full dict, presumable because there was no need to run a key in dict check on it.