Python: What storage method to use for multi-property dictionary? - python

I'm relatively new to coding in python. I know most of the syntax, but actually applying it within real programs is new to me. I wanted to know, if I have some item with multiple properties, how would I store that?
I am creating a Japanese learning tool, following a textbook, and I need a way to store, and later access the vocabulary. For example...
If I have the word おはよう, this in romanized type is "ohayou", and its definition is "Good Morning", also this vocab is located in "Lesson 1" of the book.
I was thinking of creating a dictionary, with maybe a tuple/array/list for the value, or key to store more properties per vocab word. Then I thought maybe I could use a class as well, but thought I would need a class for each vocab word as objects? I just want to know what would be the most efficient, and easy storage method for these vocab words and all their different English, and Japanese properties.

This is pretty broad, but your guiding consideration should be that you need to provide hashable keys. Given your constraints, I think a tuple (or namedtuple) would fit well. Namedtuple operates like a record, or lightweight class, so you can get the benefits of calling with dot notation while having an immutable data structure.

Related

How to generate homophones on substring level?

I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.
How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.

How to create a nested data structure in Python?

Since I recently started a new project, I'm stuck in the "think before you code" phase. I've always done basic coding, but I really think I now need to carefully plan how I should organize the results that are produced by my script.
It's essentially quite simple: I have a bunch of satellite data I'm extracting from Google Earth Engine, including different sensors, different acquisition modes, etc. What I would like to do is to loop through a list of "sensor-acquisition_mode" couples, request the data, do some more processing, and finally save it to a variable or file.
Suppose I have the following example:
sensors = ['landsat','sentinel1']
sentinel_modes = ['ASCENDING','DESCENDING']
sentinel_polarization = ['VV','VH']
In the end, I would like to have some sort of nested data structure that at the highest level has the elements 'landsat' and 'sentinel1'; under 'landsat' I would have a time and values matrix; under 'sentinel1' I would have the different modes and then as well the data matrices.
I've been thinking about lists, dictionaries or classes with attributes, but I really can't make up my mind, also since I don't have that much of experience.
At this stage, a little help in the right direction would be much appreciated!
Lists: Don't use lists for nested and complex data structures. You're just shooting yourself in the foot- code you write will be specialized to the exact format you are using, and any changes or additions will be brutal to implement.
Dictionaries: Aren't bad- they'll nest nicely and you can use a dictionary whose value is a dictionary to hold named info about the keys. This is probably the easiest choice.
Classes: Classes are really really useful for this if you need a lot of behavior to go with them - you want the string of it to be represented a certain way, you want to be able to use primitive operators for some functionality, or you just want to make the code slightly more readable or reusable.
From there, it's all your choice- if you want to go through the extra code (it's good for you) of writing them as classes, do it! Otherwise, dictionaries will get you where you need to go. Notably the only thing a dictionary couldn't do would be if you have two things that should be at the key level in the dictionary with the same name (Dicts don't do repetition).

Efficient persistent storage for lists in Python

I have a (key, value) map where for each key I have a somewhat large list of heterogeneous lists (~max about 250 items). Each list is a mix of strings and numbers that I might want to iterate over. The key is a string. If I wanted to store such a list with thousands of such (key, value) pairs persistently for efficient retrieval what are the best options? If I use sqlite then I would need to create a table for each key and then map the lists to individual records in the database. Are there better and efficient options if the goal is fast retrieval of the list of lists for a particular key?
Here is a short example. Say animals is a map of keys to list of lists. Sample data looks like this:
animals = {
"Lion" : [["Siberian", 203, "Tanzania", 123.56], ["Russian", 321, "Timbktu", 23423.2]],
"Tiger: [["White", 121, "Australia", 1211.1], ["Indian", 111, "India", 1241.5]]
}
So I want to be able to persist this data structure and be able to quickly index by the name of an animal (always unique) and get the list of lists for the particular animal I care about. If the lists within each animal's info is of fixed length and fixed fields, can I exploit that feature somehow to improve efficiency?
As Blender states in the comment, pickle is a reasonable choice. Make sure not to use the original version, though, and instead use the C-based cPickle. Alternatively, consider dill.
I would suggest one of the fast JSON libraries. There are several speed comparisons online that suggest that JSON can be as fast or rather faster than pickle. Check this one for example:
http://lvsl.github.io/2011/12/28/python-serialization-benchmark.html
and
https://blog.hartleybrody.com/python-serialize/
There are several JSON serialization alternatives, and again, there are some comparisons online, e.g.
https://medium.com/#jyotiska/json-vs-simplejson-vs-ujson-a115a63a9e26
I would suggest looking into ujson, which seems to be really fast and has one big advantage over e.g. pickle, it's very easy to inspect the data as they are saved in a human readable format. On the other hand pickle will be a bit easier to use with custom types, although you can still define custom encoders for custom types for JSON. Overall, choose JSON if you care more about human readability, and pickle if what really matters is having a few lines of code less for custom types.
Depending on your needs, you may want to consider REDIS which is an excellent key:value database solution. This tutorial provides a relatively quick introduction.

Python interval based sparse container

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl
If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

Object vs. Dictionary: how to organise a data tree?

I am programming some kind of simulation with its data organised in a tree. The main object is World which holds a bunch of methods and a list of City objects. Each City object in turn has a bunch of methods and a list of Population objects. Population objects have no method of their own, they merely hold attributes.
My question regards the latter Population objects, which I can either derive from object or create as dictionaries. What is the most efficient way to organise these?
Here are few cases which illustrate my hesitation:
Saving the Data
I need to be able to save and load the simulation, for which purpose I use the built-in json (I want the data to be human readable). Because of the program is organised in a tree, saving data at each level can be cumbersome. In this case, the population is best kept as a dictionary appended to a population list as an attribute of a City instance. This way, saving is a mere matter of passing the City instance's __dict__ into Json.
Using the Data
If I want to manipulate the population data, it is easier as a class instance than as a dictionary. Not only is the syntax simple, but I can also enjoy introspection features better while coding.
Performance
I am not sure, finally, as to what is the most efficient in terms of resources. An object and a dictionary have little difference in the end, since each object has a __dict__ attribute, which can be used to access all its attributes. If i run my simulation with large numbers of City and Population objects, what will be using the less resources: objects or dictionaries?
So again, what is the most efficient way to organise data in a tree? Are dictionaries or objects preferable? Or is there any secret to organising the data trees?
Why not a hybrid dict/object?
class Population(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
Now you can easily access known names via attributes (foo.bar), while still having the dict functionality to easily access unknown names, iterate over them, etc. without the clunky getattr/setattr syntax.
If you want to always initialize them with particular fields, you can add an __init__ method:
def __init__(self, starting=0, birthrate=100, imrate=10, emrate=10, deathrate=100):
self.update(n=starting, b=birthrate, i=imrate, e=emrate, d=deathrate)
As you've seen yourself, there is little practical difference - the main difference, in my opinion, is that using individual, hard-coded attributes is slightly easier with objects (no need to quote the name) while dicts easily allow treating all values as one collection (e.g. summing them). This is why I'd go for objects, since the data of the population objects is likely heterogenous and relatively independent.
I think you should consider using a namedtuple (see the Python docs on the collections module). You get to access the attributes of the Population object by name like you would with a normal class, e.g. population.attribute_name instead of population['attribute_name'] for a dictionary. Since you're not putting any methods on the Population class this is all you need.
For your "saving data" criterion, there's also an _asdict method which returns a dictionary of field names to values that you could pass to json. (You might need to be careful about exactly what you get back from this method depending on which version of Python you're using. Some versions return a dictionary, and some return an OrderedDict. This might not make any difference for your purposes.)
namedtuples are also pretty lightweight, so they also work with your 'Running the Simulation' resource requirement. However, I'd echo other people's caution in saying not to worry about that, there's going to be very little difference unless you're doing some serious data-crunching.
I'd say that in every case a Population is a member of a City, and if it's data only, why not use a dictionary?
Don't worry about performance, but if your really need to know I think a dict is faster.

Categories

Resources