I am brand-new to Python, having decided to make the jump from Matlab. I have tried to find the answer to my question for days but without success!
The problem: I have a bunch of objects with certain attributes. Note that I am not talking about objects and attributes in the programming sense of the word - I am talking about literal astronomical objects about which I have different types of numerical data and physical attributes for.
In a loop in my script, I go through each source/object in my catalogue, do some calculations, and stick the results in a huge dictionary. The form of the script is like this:
for i in range ( len(ObjectCatalogue) )
calculate quantity1 for source i
calculate quantity2 for source i
determine attribute1 for source i
sourceDataDict[i].update( {'spectrum':quantity1} )
sourceDataDict[i].update( {'peakflux':quantity2} )
sourceDataDict[i].update( {'morphology':attribute1} )
So once I have gone through a hundred sources or so, I can, say, access the spectrum for object no. 20 with spectrumSource20 = sourceData[20]['spectrum'] etc.
What I want to do is be able to select all objects in the dictionary based on the value of the keyword 'morphology' say. So say the keyword for 'morphology' can take on the values 'simple' or 'complex'. Is there anyway I can do this without resorting to a loop? I.e. - could I do something like create a new dictionary that contains all the sources that take the 'complex' value for the 'morphology' keyword?
Hard to explain, but using logical indexing that I am used to from Matlab, it would look something like
complexSourceDataDict = sourceDataDict[*]['morphology'=='complex']
(where * indicates all objects in the dictionary)
Anyway - any help would be greatly appreciated!
Without a loop, no. With a list comprehension, yes:
complex = [src for src in sourceDataDict.itervalues() if src.get('morphology') == 'complex']
If sourceDataDict happens to really be a list, you can drop the itervalues:
complex = [src for src in sourceDataDict if src.get('morphology') == 'complex']
If you think about it, evaluating a * would imply a loop operation under the hood anyways (assuming it were valid syntax). So your trick is to do the most efficient looping you can with the data structure you are using.
The only way to get more efficient would be to index all of the data objects "morphology" keys ahead of time and keep them up to date.
There's not a direct way to index nested dictionaries out of order, like your desired syntax wants to do. However, there are a few ways to do it in Python, with varying interfaces and performance characteristics.
The best performing solution would probably be to create an additional dictionary which indexes by whatever characteristic you care about. For instance, to find values with the 'morphology' value is 'complex', you'd d something like this:
from collections import defaultdict
# set up morphology dict (you could do this as part of generating the morphology)
morph_dict = defaultdict(list)
for data in sourceDataDict.values():
morph_dict[data["morphology"]].append(data)
# later, you can access a list of the values with any particular morphology
complex_morph = morph_dict["complex"]
While this is high-performance, it may be annoying to need to set up the reverse indexes for everything ahead of time. An alternative might be to use a list comprehension or generator expression to iterate over your dictionary and finding the appropriate values:
complex = (d for d in sourceDataDict.values() if d["morphology"] == "complex")
for c in complex:
do_whatever(c)
I believe you are dealing with a structure similar to the following
sourceDataDict = [
{'spectrum':1,
'peakflux':10,
'morphology':'simple'
},
{'spectrum':2,
'peakflux':11,
'morphology':'comlex'
},
{'spectrum':3,
'peakflux':12,
'morphology':'simple'
},
{'spectrum':4,
'peakflux':13,
'morphology':'complex'
}
]
you can do something similar using List COmprehension
>>> [e for e in sourceDataDict if e.get('morphology',None) == 'complex']
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
Using itertools.ifilter, you can achieve a similar result
>>> list(itertools.ifilter(lambda e:e.get('morphology',None) == 'complex', sourceDataDict))
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
Please note, the use of get instead of indexing is to ensure that the functionality wont fail even when the key 'morphology' does not exist. In case, its definite to exist, you can rewrite the above as
>>> [e for e in sourceDataDict if e['morphology'] == 'complex']
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
>>> list(itertools.ifilter(lambda e:e['morphology'] == 'complex', sourceDataDict))
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
Working with huge amount of data, you may want to store it somewhere, so some sort of database and ORM (for instance), but latter is a matter of taste. Sort of RDBMS may be solution.
As for raw python, there is no built-in solution except of functional routines like filter. Anyway you face iteration at some step (implicitly or not).
The simpliest way is is keeping additional dict with keys as attribute values:
objectsBy['morphology'] = {'complex': set(), 'simple': set()}
for item in sources:
...
objMorphology = compute_morphology(item)
objectsBy['morphology'][objMorphology] += item
...
Related
The output of the code example below has terrible readability. The data I'd like to analyse is hidden within numerous frozenset({}) prints.
A = frozenset(["A1", "A2"])
B = frozenset(["B1", "B2"])
C = frozenset(["C1", "C2"])
foo = {A, B, C}
print(foo)
# {frozenset({'B1', 'B2'}), frozenset({'C1', 'C2'}), frozenset({'A1', 'A2'})}
print(*foo)
# frozenset({'B1', 'B2'}) frozenset({'C1', 'C2'}) frozenset({'A1', 'A2'})
Is there an easy way to neglect the printouts of the data collection type? I'm only interested in the grouping of the entries.
A more readable output would be:
({'B1', 'B2'}, {'C1', 'C2'}, {'A1', 'A2'})
TLD
I stumbled upon this issue when trying to find to optimal way to group a large number of items in pairs. The grouping had to comply to a lot of boundary conditions and I need a quick check if the solution found makes sense.
items = {'B1', 'B2', 'C1', 'C2', 'A1', 'A2'}
pairs = get_best_grouping(items) # Outputs a set containing frozensets
print(pairs)
Change the str function of frozenset (or any other native type) is a related question, but seems like a big intervention for this problem.
As you mentioned in the question, writing your own __str__ or __repr__ function would be the propper way to go, if that sounds to much hassle, how about you modify the string after the fact?
print(str(frozenset({1,2,3})).replace('frozenset',''))
({1, 2, 3})
Assuming of course your set does not contain the word "frozenset".
But it is really not much more effort to make a new _frozenset class follwing the same logic as above (note the 1 in replace to make sure that we only replace the first occurence of the str '_frozenset'.):
class _frozenset(frozenset):
def __repr__(self):
return (frozenset.__repr__(self)).replace('_frozenset','',1)
print(_frozenset([1, 2, '_frozenset']))
({1, 2, '_frozenset'})
Imho, key is here to simply reuse the definition of __str__ of the builtin frozenset so we don't have to worry too much about the logic behind how to represent an iterable. Against first intuition, frozenset.__str__() seems to (I have not looked into it) inspect the name of the class it is in and prepend that so it is not 'frozenset' but '_frozenset' one needs to replace.
Suppose I have some kind of dictionary structure like this (or another data structure representing the same thing.
d = {
42.123231:'X',
42.1432423:'Y',
45.3213213:'Z',
..etc
}
I want to create a function like this:
f(n,d,e):
'''Return a list with the values in dictionary d corresponding to the float n
within (+/-) the float error term e'''
So if I called the function like this with the above dictionary:
f(42,d,2)
It would return
['X','Y']
However, while it is straightforward to write this function with a loop, I don't want to do something that goes through every value in the dictionary and checks it exhaustively, but I want it to take advantage of the indexed structure somehow (or a even a sorted list could be used) to make the search much faster.
Dictionary is a wrong data structure for this. Write a search tree.
Python dictionary is a hashmap implementation. Its keys can't be compared and traversed as in search tree. So you simply can't do it using python dictionary without actually checking all keys.
Dictionaries with numeric keys are usually sorted - by key values. But you may - to be on the safe side - rearrange it as OrderedDictionary - you do it once
from collections import OrderedDict
d_ordered = OrderedDict(sorted(d.items(), key =lambda i:i[0]))
Then filtering values is rather simple - and it will stop at the upper border
import itertools
values = [val for k, val in
itertools.takewhile(lambda (k,v): k<upper, d_ordered.iteritems())
if k > lower]
As I've already stated, ordering dictionary is not really necessary - but some will say that this assumption is based on the current implementation and may change in the future.
EDIT: as #BrenBarn pointed out, the original didn't make sense.
Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.
# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]
# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
key = str(d)
if key not in keys:
keys.add(key)
filtered.append(d)
# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}
# iv. something else entirely
In the spirit of the Zen of Python what's the "obvious way to do it"?
Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.
Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:
filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()
Or in Python 3, assuming you want a list rather than a dictionary view:
filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())
These are both bit clunky. For readability, consider breaking it into two lines:
dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()
The alternative is an ordered tuple of tuples:
filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()
And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:
>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False
The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:
groups = {}
for d in pile_o_dicts:
k = tuple(sorted(d.items()))
groups.setdefault(k, []).append(d)
This will group identical dictionaries together.
FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)
If the dicts all have the same keys, you can use a namedtuple
>>> from collections import namedtuple
>>> nt = namedtuple('nt', pile_o_dicts[0])
>>> set(nt(**d) for d in pile_o_dicts)
I got a list of objects which look like strings, but are not real strings (think about mmap'ed files). Like this:
x = [ "abc", "defgh", "ij" ]
What i want is x to be directly indexable like it was a big string, i.e.:
(x[4] == "e") is True
(Of course I don't want to do "".join(x) which would merge all strings, because reading a string is too expensive in my case. Remember it's mmap'ed files.).
This is easy if you iterate over the entire list, but it seems to be O(n). So I've implemented __getitem__ more efficiently by creating such a list:
x = [ (0, "abc"), (3, "defgh"), (8, "ij") ]
Therefore I can do a binary search in __getitem__ to quickly find the tuple with the right data and then indexing its string. This works quite well.
I see how to implement __setitem__, but it seems so boring, I'm wondering if there's not something that already does that.
To be more precise, this is how the data structure should honor __setitem__:
>>> x = [ "abc", "defgh", "ij" ]
>>> x[2:10] = "12345678"
>>> x
[ "ab", "12345678", "j" ]
I'd take any idea about such a data structure implementation, name or any hint.
What you are describing is a special case of the rope data structure.
Unfortunately, I am not aware of any Python implementations.
You have recreated the dictionary data type.
So do you still want to be able to address the n'th list element at all, like find that x.somemethod(2) == 'ij'?
If not, then your data structure is just a string with some methods to make it mutable and to initialize it from a list of strings.
If you do want to be able to, then your data structure is still a string with those extra methods, plus another element to track the ranges where its elements came from, like x.camefrom(1) == (3, 7).
Either way, it appears that you want to be storing and manipulating a string.
This could be a start:
self._h = {0:"abc", 3:"defgh", 8:"ij"} #create _h and __len__ in __init__
self.__len__ = 10
def __getitem__(i):
if i >= self.__len__:
raise IndexError
o=0
while True:
if i-o in self._h:
return self._h[i-o][o]
o+=1
improvements contain mutability.
I'm not aware of anything that does what you want.
However, if you've implemented __getitem__ efficiently the way you say, then you already have the code that maps an index to your tuple, string list. Therefore it seems like you could just reuse that bit of code -- with a little refactoring -- to implement __setitem__ which needs the same information to perform its function.
I have a problem which requires a reversable 1:1 mapping of keys to values.
That means sometimes I want to find the value given a key, but at other times I want to find the key given the value. Both keys and values are guaranteed unique.
x = D[y]
y == D.inverse[x]
The obvious solution is to simply invert the dictionary every time I want a reverse-lookup: Inverting a dictionary is very easy, there's a recipe here but for a large dictionary it can be very slow.
The other alternative is to make a new class which unites two dictionaries, one for each kind of lookup. That would most likely be fast but would use up twice as much memory as a single dict.
So is there a better structure I can use?
My application requires that this should be very fast and use as little as possible memory.
The structure must be mutable, and it's strongly desirable that mutating the object should not cause it to be slower (e.g. to force a complete re-index)
We can guarantee that either the key or the value (or both) will be an integer
It's likely that the structure will be needed to store thousands or possibly millions of items.
Keys & Valus are guaranteed to be unique, i.e. len(set(x)) == len(x) for for x in [D.keys(), D.valuies()]
The other alternative is to make a new
class which unites two dictionaries,
one for each kind of lookup. That
would most likely be fast but would
use up twice as much memory as a
single dict.
Not really. Have you measured that? Since both dictionaries would use references to the same objects as keys and values, then the memory spent would be just the dictionary structure. That's a lot less than twice and is a fixed ammount regardless of your data size.
What I mean is that the actual data wouldn't be copied. So you'd spend little extra memory.
Example:
a = "some really really big text spending a lot of memory"
number_to_text = {1: a}
text_to_number = {a: 1}
Only a single copy of the "really big" string exists, so you end up spending just a little more memory. That's generally affordable.
I can't imagine a solution where you'd have the key lookup speed when looking by value, if you don't spend at least enough memory to store a reverse lookup hash table (which is exactly what's being done in your "unite two dicts" solution).
class TwoWay:
def __init__(self):
self.d = {}
def add(self, k, v):
self.d[k] = v
self.d[v] = k
def remove(self, k):
self.d.pop(self.d.pop(k))
def get(self, k):
return self.d[k]
The other alternative is to make a new class which unites two dictionaries, one for each > kind of lookup. That would most likely use up twice as much memory as a single dict.
Not really, since they would just be holding two references to the same data. In my mind, this is not a bad solution.
Have you considered an in-memory database lookup? I am not sure how it will compare in speed, but lookups in relational databases can be very fast.
Here is my own solution to this problem: http://github.com/spenthil/pymathmap/blob/master/pymathmap.py
The goal is to make it as transparent to the user as possible. The only introduced significant attribute is partner.
OneToOneDict subclasses from dict - I know that isn't generally recommended, but I think I have the common use cases covered. The backend is pretty simple, it (dict1) keeps a weakref to a 'partner' OneToOneDict (dict2) which is its inverse. When dict1 is modified dict2 is updated accordingly as well and vice versa.
From the docstring:
>>> dict1 = OneToOneDict()
>>> dict2 = OneToOneDict()
>>> dict1.partner = dict2
>>> assert(dict1 is dict2.partner)
>>> assert(dict2 is dict1.partner)
>>> dict1['one'] = '1'
>>> dict2['2'] = '1'
>>> dict1['one'] = 'wow'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1['one'] = '1'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1.update({'three': '3', 'four': '4'})
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict3 = OneToOneDict({'4':'four'})
>>> assert(dict3.partner is None)
>>> assert(dict3 == {'4':'four'})
>>> dict1.partner = dict3
>>> assert(dict1.partner is not dict2)
>>> assert(dict2.partner is None)
>>> assert(dict1.partner is dict3)
>>> assert(dict3.partner is dict1)
>>> dict1.setdefault('five', '5')
>>> dict1['five']
'5'
>>> dict1.setdefault('five', '0')
>>> dict1['five']
'5'
When I get some free time, I intend to make a version that doesn't store things twice. No clue when that'll be though :)
Assuming that you have a key with which you look up a more complex mutable object, just make the key a property of that object. It does seem you might be better off thinking about the data model a bit.
"We can guarantee that either the key or the value (or both) will be an integer"
That's weirdly written -- "key or the value (or both)" doesn't feel right. Either they're all integers, or they're not all integers.
It sounds like they're all integers.
Or, it sounds like you're thinking of replacing the target object with an integer value so you only have one copy referenced by an integer. This is a false economy. Just keep the target object. All Python objects are -- in effect -- references. Very little actual copying gets done.
Let's pretend that you simply have two integers and can do a lookup on either one of the pair. One way to do this is to use heap queues or the bisect module to maintain ordered lists of integer key-value tuples.
See http://docs.python.org/library/heapq.html#module-heapq
See http://docs.python.org/library/bisect.html#module-bisect
You have one heapq (key,value) tuples. Or, if your underlying object is more complex, the (key,object) tuples.
You have another heapq (value,key) tuples. Or, if your underlying object is more complex, (otherkey,object) tuples.
An "insert" becomes two inserts, one to each heapq-structured list.
A key lookup is in one queue; a value lookup is in the other queue. Do the lookups using bisect(list,item).
It so happens that I find myself asking this question all the time (yesterday in particular). I agree with the approach of making two dictionaries. Do some benchmarking to see how much memory it's taking. I've never needed to make it mutable, but here's how I abstract it, if it's of any use:
class BiDict(list):
def __init__(self,*pairs):
super(list,self).__init__(pairs)
self._first_access = {}
self._second_access = {}
for pair in pairs:
self._first_access[pair[0]] = pair[1]
self._second_access[pair[1]] = pair[0]
self.append(pair)
def _get_by_first(self,key):
return self._first_access[key]
def _get_by_second(self,key):
return self._second_access[key]
# You'll have to do some overrides to make it mutable
# Methods such as append, __add__, __del__, __iadd__
# to name a few will have to maintain ._*_access
class Constants(BiDict):
# An implementation expecting an integer and a string
get_by_name = BiDict._get_by_second
get_by_number = BiDict._get_by_first
t = Constants(
( 1, 'foo'),
( 5, 'bar'),
( 8, 'baz'),
)
>>> print t.get_by_number(5)
bar
>>> print t.get_by_name('baz')
8
>>> print t
[(1, 'foo'), (5, 'bar'), (8, 'baz')]
How about using sqlite? Just create a :memory: database with a two-column table. You can even add indexes, then query by either one. Wrap it in a class if it's something you're going to use a lot.