Since I recently started a new project, I'm stuck in the "think before you code" phase. I've always done basic coding, but I really think I now need to carefully plan how I should organize the results that are produced by my script.
It's essentially quite simple: I have a bunch of satellite data I'm extracting from Google Earth Engine, including different sensors, different acquisition modes, etc. What I would like to do is to loop through a list of "sensor-acquisition_mode" couples, request the data, do some more processing, and finally save it to a variable or file.
Suppose I have the following example:
sensors = ['landsat','sentinel1']
sentinel_modes = ['ASCENDING','DESCENDING']
sentinel_polarization = ['VV','VH']
In the end, I would like to have some sort of nested data structure that at the highest level has the elements 'landsat' and 'sentinel1'; under 'landsat' I would have a time and values matrix; under 'sentinel1' I would have the different modes and then as well the data matrices.
I've been thinking about lists, dictionaries or classes with attributes, but I really can't make up my mind, also since I don't have that much of experience.
At this stage, a little help in the right direction would be much appreciated!
Lists: Don't use lists for nested and complex data structures. You're just shooting yourself in the foot- code you write will be specialized to the exact format you are using, and any changes or additions will be brutal to implement.
Dictionaries: Aren't bad- they'll nest nicely and you can use a dictionary whose value is a dictionary to hold named info about the keys. This is probably the easiest choice.
Classes: Classes are really really useful for this if you need a lot of behavior to go with them - you want the string of it to be represented a certain way, you want to be able to use primitive operators for some functionality, or you just want to make the code slightly more readable or reusable.
From there, it's all your choice- if you want to go through the extra code (it's good for you) of writing them as classes, do it! Otherwise, dictionaries will get you where you need to go. Notably the only thing a dictionary couldn't do would be if you have two things that should be at the key level in the dictionary with the same name (Dicts don't do repetition).
Related
I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
I'm going to take two different outputs and verify three pieces for each line.
Problem is, later I'll have a output C which I want to try and verify again output A and B. I'm leaning towards a dictionary?
For example: (bold is what I'll be matching
output A:
10.1.1.1:80 10.2.1.1:81 10.1.1.1:80 10.3.1.1:81 etc etc etc name
...
...
...
output B:
name etc etc etc etc 10.1.1.1/16 10.2.1.1/16
...
...
...
I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
They key difference between a list and a dictionary is the way you access your data. For a dictionary, you will be using keys, where for lists you would be using a sequence of indexes to glean your data. Which one should you use here? This is sometimes a performance issue and sometimes it's about code clarity. I don't think the latter issue relates to your case, I think performance would be your greatest asset here, that's because you're dealing with dozens of lines to be processed and matched.There has been many internal optimizations for Python objects, so it's better to focus on what you're trying to achieve instead of focusing on objects. I could very easily recommend dictionaries, but would that be a good recommendation for later developments in your code? I don't know, unless we work on that specific code any recommendation would be vague.
If you care about performance, this might help:
Python: List vs Dict for look up table
I have a (key, value) map where for each key I have a somewhat large list of heterogeneous lists (~max about 250 items). Each list is a mix of strings and numbers that I might want to iterate over. The key is a string. If I wanted to store such a list with thousands of such (key, value) pairs persistently for efficient retrieval what are the best options? If I use sqlite then I would need to create a table for each key and then map the lists to individual records in the database. Are there better and efficient options if the goal is fast retrieval of the list of lists for a particular key?
Here is a short example. Say animals is a map of keys to list of lists. Sample data looks like this:
animals = {
"Lion" : [["Siberian", 203, "Tanzania", 123.56], ["Russian", 321, "Timbktu", 23423.2]],
"Tiger: [["White", 121, "Australia", 1211.1], ["Indian", 111, "India", 1241.5]]
}
So I want to be able to persist this data structure and be able to quickly index by the name of an animal (always unique) and get the list of lists for the particular animal I care about. If the lists within each animal's info is of fixed length and fixed fields, can I exploit that feature somehow to improve efficiency?
As Blender states in the comment, pickle is a reasonable choice. Make sure not to use the original version, though, and instead use the C-based cPickle. Alternatively, consider dill.
I would suggest one of the fast JSON libraries. There are several speed comparisons online that suggest that JSON can be as fast or rather faster than pickle. Check this one for example:
http://lvsl.github.io/2011/12/28/python-serialization-benchmark.html
and
https://blog.hartleybrody.com/python-serialize/
There are several JSON serialization alternatives, and again, there are some comparisons online, e.g.
https://medium.com/#jyotiska/json-vs-simplejson-vs-ujson-a115a63a9e26
I would suggest looking into ujson, which seems to be really fast and has one big advantage over e.g. pickle, it's very easy to inspect the data as they are saved in a human readable format. On the other hand pickle will be a bit easier to use with custom types, although you can still define custom encoders for custom types for JSON. Overall, choose JSON if you care more about human readability, and pickle if what really matters is having a few lines of code less for custom types.
Depending on your needs, you may want to consider REDIS which is an excellent key:value database solution. This tutorial provides a relatively quick introduction.
I am creating a python module that creates and operates on data structures to store lots of semantically tagged data and metadata from real experiments. So in an experiment you have:
subjects
treatments
replicates
Enclosing these 3 categories is the experiment, and combinations of the three categories are what I am calling "units". Now there is no inherently correct hierarchy between the 3 (table-like) but for certain analyses it is useful to think of a certain permutation of the 3 as a hierarchy,
e.g. (subjects-->(treatments-->(replicates)))
or
(replicates-->(treatments-->(subjects)))
Moreover, when collecting data, files will be copy-pasted into a folder on a desktop, so data is at least coming in as a tree. I have thought a lot about which hierarchy is "better" but I keep coming up with use cases for most of the 6 possible permutations. I want my module to be flexible in that the user can think of the experiment or collect the data using whatever hierarchy, table, hierarchy-table hybrid makes sense to them.
Also the "units" or (table entries) are containers for arbitrary amounts of data (bytes to Gigabytes, whatever ideally) of any organizational complexity. This is why I didn't think a relational database approach was really the way to go and a NoSQL type solution makes more sense. But then i have the problem of how to order the three categories if none is "correct".
So my question is what is this multifaceted data structure?
Does some sort of fluid data structure or set of algorithms exist to easily inter-convert or produce structured views?
The short answer is that HDF5 addresses these fairly common concerns and I would suggest it. http://www.hdfgroup.org/HDF5/
In python: http://docs.h5py.org/en/latest/high/group.html
http://odo.pydata.org/en/latest/hdf5.html
will help.
I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl
If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.
I have this somewhat big data structure that stores pairs of data. The individual data is tiny and readily hashable, and there are something like a few hundred thousand data points in there.
At first, this was a simple dict that was accessed only by keys. Later on however, I discovered that I also needed to access it by value, that is, get the key for a certain value. Since this was done somewhat less often (~1/10) than access by key, I naïvely implemented it by simply iterating over all the dicts items(). Which proved a bit sluggish at a few hundred thousand calls per second. It is about 500 times slower.
So my next idea was to just use save the reverse dict, too. This seems to be a rather inelegant solution however, so I turn to you guys for help.
Do you know any data structure in Python that stores data pairs that can be accessed by either data point of the pair?
You could try bidict.