How to glob for iterable element

How to glob for iterable element - python

I have a python dictionary that contains iterables, some of which are lists, but most of which are other dictionaries. I'd like to do glob-style assignment similar to the following:
myiter['*']['*.txt']['name'] = 'Woot'
That is, for each element in myiter, look up all elements with keys ending in '.txt' and then set their 'name' item to 'Woot'.
I've thought about sub-classing dict and using the fnmatch module. But, it's unclear to me what the best way of accomplishing this is.

The best way, I think, would be not to do it -- '*' is a perfectly valid key in a dict, so myiter['*'] has a perfectly well defined meaning and usefulness, and subverting that can definitely cause problems. How to "glob" over keys which are not strings, including the exclusively integer "keys" (indices) in elements which are lists and not mappings, is also quite a design problem.
If you nevertheless must do it, I would recommend taking full control by subclassing the abstract base class collections.MutableMapping, and implement the needed methods (__len__, __iter__, __getitem__, __setitem__, __delitem__, and, for better performance, also override others such as __contains__, which the ABC does implement on the base of the others, but slowly) in terms of a contained dict. Subclassing dict instead, as per other suggestions, would require you to override a huge number of methods to avoid inconsistent behavior between the use of "keys containing wildcards" in the methods you do override, and in those you don't.
Whether you subclass collections.MutableMapping, or dict, to make your Globbable class, you have to make a core design decision: what does yourthing[somekey] return when yourthing is a Globbable?
Presumably it has to return a different type when somekey is a string containing wildcards, versus anything else. In the latter case, one would imagine, just what is actually at that entry; but in the former, it can't just return another Globbable -- otherwise, what would yourthing[somekey] = 'bah' do in the general case? For your single "slick syntax" example, you want it to set a somekey entry in each of the items of yourthing (a HUGE semantic break with the behavior of every other mapping in the universe;-) -- but then, how would you ever set an entry in yourthing itself?!
Let's see if the Zen of Python has anything to say about this "slick syntax" for which you yearn...:
>>> import this
...
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Consider for a moment the alternative of losing the "slick syntax" (and all the huge semantic headaches it necessarily implies) in favor of clarity and simplicity (using Python 2.7-and-better syntax here, just for the dict comprehension -- use an explicit dict(...) call instead if you're stuck with 2.6 or earlier), e.g.:
def match(s, pat):
try: return fnmatch.fnmatch(s, pat)
except TypeError: return False
def sel(ds, pat):
return [d[k] for d in ds for k in d if match(k, pat)]
def set(ds, k, v):
for d in ds: d[k] = v
so your assignment might become
set(sel(sel([myiter], '*')), '*.txt'), 'name', 'Woot')
(the selection with '*' being redundant if all , I'm just omitting it). Is this so horrible as to be worth the morass of issues I've mentioned above in order to use instead
myiter['*']['*.txt']['name'] = 'Woot'
...? By far the clearest and best-performing way, of course, remains the even-simpler
def match(k, v, pat):
try:
if fnmatch.fnmatch(k, pat):
return isinstance(v, dict)
except TypeError:
return False
for k, v in myiter.items():
if match(k, v, '*'):
for sk, sv in v.items():
if match(sk, sv, '*.txt'):
sv['name'] = 'Woot'
but if you absolutely crave conciseness and compactness, despising the Zen of Python's koan "Sparse is better than dense", you can at least obtain them without the various nightmares I mentioned as needed to achieve your ideal "syntax sugar".

The best way is to subclass dict and use the fnmatch module.
subclass dict: adding functionality you want in an object-oriented way.
fnmatch module: reuse of existing functionality.

You could use fnmatch for functionality to match on dictionary keys although you would have to compromise syntax slightly, especially if you wanted to do this on a nested dictionary. Perhaps a custom dictionary-like class with a search method to return wildcard matches would work well.
Here is a VERY BASIC example that comes with a warning that this is NOT RECURSIVE and will not handle nested dictionaries:
from fnmatch import fnmatch
class GlobDict(dict):
def glob(self, match):
"""#match should be a glob style pattern match (e.g. '*.txt')"""
return dict([(k,v) for k,v in self.items() if fnmatch(k, match)])
# Start with a basic dict
basic_dict = {'file1.jpg':'image', 'file2.txt':'text', 'file3.mpg':'movie',
'file4.txt':'text'}
# Create a GlobDict from it
glob_dict = GlobDict( **basic_dict )
# Then get glob-styl results!
globbed_results = glob_dict.glob('*.txt')
# => {'file4.txt': 'text', 'file2.txt': 'text'}
As for what way is the best? The best way is the one that works. Don't try to optimize a solution before it's even created!

Following the principle of least magic, perhaps just define a recursive function, rather than subclassing dict:
import fnmatch
def set_dict_with_pat(it,key_patterns,value):
if len(key_patterns)>1:
for key in it:
if fnmatch.fnmatch(key,key_patterns[0]):
set_dict_with_pat(it[key],key_patterns[1:],value)
else:
for key in it:
if fnmatch.fnmatch(key,key_patterns[0]):
it[key]=value
Which could be used like this:
myiter=({'dir1':{'a.txt':{'name':'Roger'},'b.notxt':{'name':'Carl'}},'dir2':{'b.txt':{'name':'Sally'}}})
set_dict_with_pat(myiter,['*','*.txt','name'],'Woot')
print(myiter)
# {'dir2': {'b.txt': {'name': 'Woot'}}, 'dir1': {'b.notxt': {'name': 'Carl'}, 'a.txt': {'name': 'Woot'}}}

Related

How to check if a variable is a dictionary in Python?

How would you check if a variable is a dictionary in Python?
For example, I'd like it to loop through the values in the dictionary until it finds a dictionary. Then, loop through the one it finds:
dict = {'abc': 'abc', 'def': {'ghi': 'ghi', 'jkl': 'jkl'}}
for k, v in dict.iteritems():
if ###check if v is a dictionary:
for k, v in v.iteritems():
print(k, ' ', v)
else:
print(k, ' ', v)

You could use if type(ele) is dict or use isinstance(ele, dict) which would work if you had subclassed dict:
d = {'abc': 'abc', 'def': {'ghi': 'ghi', 'jkl': 'jkl'}}
for element in d.values():
if isinstance(element, dict):
for k, v in element.items():
print(k,' ',v)

How would you check if a variable is a dictionary in Python?
This is an excellent question, but it is unfortunate that the most upvoted answer leads with a poor recommendation, type(obj) is dict.
(Note that you should also not use dict as a variable name - it's the name of the builtin object.)
If you are writing code that will be imported and used by others, do not presume that they will use the dict builtin directly - making that presumption makes your code more inflexible and in this case, create easily hidden bugs that would not error the program out.
I strongly suggest, for the purposes of correctness, maintainability, and flexibility for future users, never having less flexible, unidiomatic expressions in your code when there are more flexible, idiomatic expressions.
is is a test for object identity. It does not support inheritance, it does not support any abstraction, and it does not support the interface.
So I will provide several options that do.
Supporting inheritance:
This is the first recommendation I would make, because it allows for users to supply their own subclass of dict, or a OrderedDict, defaultdict, or Counter from the collections module:
if isinstance(any_object, dict):
But there are even more flexible options.
Supporting abstractions:
from collections.abc import Mapping
if isinstance(any_object, Mapping):
This allows the user of your code to use their own custom implementation of an abstract Mapping, which also includes any subclass of dict, and still get the correct behavior.
Use the interface
You commonly hear the OOP advice, "program to an interface".
This strategy takes advantage of Python's polymorphism or duck-typing.
So just attempt to access the interface, catching the specific expected errors (AttributeError in case there is no .items and TypeError in case items is not callable) with a reasonable fallback - and now any class that implements that interface will give you its items (note .iteritems() is gone in Python 3):
try:
items = any_object.items()
except (AttributeError, TypeError):
non_items_behavior(any_object)
else: # no exception raised
for item in items: ...
Perhaps you might think using duck-typing like this goes too far in allowing for too many false positives, and it may be, depending on your objectives for this code.
Conclusion
Don't use is to check types for standard control flow. Use isinstance, consider abstractions like Mapping or MutableMapping, and consider avoiding type-checking altogether, using the interface directly.

In python 3.6
typeVariable = type(variable)
print('comparison',typeVariable == dict)
if typeVariable == dict:
#'true'
else:
#'false'

The OP did not exclude the starting variable, so for completeness here is how to handle the generic case of processing a supposed dictionary that may include items as dictionaries.
Also following the pure Python(3.8) recommended way to test for dictionary in the above comments.
from collections.abc import Mapping
my_dict = {'abc': 'abc', 'def': {'ghi': 'ghi', 'jkl': 'jkl'}}
def parse_dict(in_dict):
if isinstance(in_dict, Mapping):
for k_outer, v_outer in in_dict.items():
if isinstance(v_outer, Mapping):
for k_inner, v_inner in v_outer.items():
print(k_inner, v_inner)
else:
print(k_outer, v_outer)
parse_dict(my_dict)

My testing has found this to work now we have type hints:
from typing import Dict
if isinstance(my_dict, Dict):
# True
else:
# False
Side note some discussion about typing.Dict here

What is the purpose of collections.ChainMap?

In Python 3.3 a ChainMap class was added to the collections module:
A ChainMap class is provided for quickly linking a number of mappings
so they can be treated as a single unit. It is often much faster than
creating a new dictionary and running multiple update() calls.
Example:
>>> from collections import ChainMap
>>> x = {'a': 1, 'b': 2}
>>> y = {'b': 10, 'c': 11}
>>> z = ChainMap(y, x)
>>> for k, v in z.items():
print(k, v)
a 1
c 11
b 10
It was motivated by this issue and made public by this one (no PEP was created).
As far as I understand, it is an alternative to having an extra dictionary and maintaining it with update()s.
The questions are:
What use cases does ChainMap cover?
Are there any real world examples of ChainMap?
Is it used in third-party libraries that switched to python3?
Bonus question: is there a way to use it on Python2.x?
I've heard about it in Transforming Code into Beautiful, Idiomatic Python PyCon talk by Raymond Hettinger and I'd like to add it to my toolkit, but I lack in understanding when should I use it.

I like #b4hand's examples, and indeed I have used in the past ChainMap-like structures (but not ChainMap itself) for the two purposes he mentions: multi-layered configuration overrides, and variable stack/scope emulation.
I'd like to point out two other motivations/advantages/differences of ChainMap, compared to using a dict-update loop, thus only storing the "final" version":
More information: since a ChainMap structure is "layered", it supports answering question like: Am I getting the "default" value, or an overridden one? What is the original ("default") value? At what level did the value get overridden (borrowing #b4hand's config example: user-config or command-line-overrides)? Using a simple dict, the information needed for answering these questions is already lost.
Speed tradeoff: suppose you have N layers and at most M keys in each, constructing a ChainMap takes O(N) and each lookup O(N) worst-case[*], while construction of a dict using an update-loop takes O(NM) and each lookup O(1). This means that if you construct often and only perform a few lookups each time, or if M is big, ChainMap's lazy-construction approach works in your favor.
[*] The analysis in (2) assumes dict-access is O(1), when in fact it is O(1) on average, and O(M) worst case. See more details here.

I could see using ChainMap for a configuration object where you have multiple scopes of configuration like command line options, a user configuration file, and a system configuration file. Since lookups are ordered by the order in the constructor argument, you can override settings at lower scopes. I've not personally used or seen ChainMap used, but that's not surprising since it is a fairly recent addition to the standard library.
It might also be useful for emulating stack frames where you push and pop variable bindings if you were trying to implement a lexical scope yourself.
The standard library docs for ChainMap give several examples and links to similar implementations in third-party libraries. Specifically, it names Django’s Context class and Enthought's MultiContext class.

I'll take a crack at this:
Chainmap looks like a very just-so kind of abstraction. It's a good solution for a very specialized kind of problem. I propose this use case.
If you have:
multiple mappings (e.g, dicts)
some duplication of keys in those mappings (same key can appear in multiple mappings, but not the case that all keys appear in all mappings)
a consuming application which wishes to access the value of a key in the "highest priority" mapping where there is a total ordering over all the mappings for any given key (that is, mappings may have equal priority, but only if it is known that there are no duplications of key within those mappings) (In the Python application, packages can live in the same directory (same priority) but must have different names, so, by definition, the symbol names in that directory cannot be duplicates.)
the consuming application does not need to change the value of a key
while at the same time the mappings must maintain their independent identity and can be changed asynchronously by an external force
and the mappings are big enough, expensive enough to access, or change often enough between application accesses, that the cost of computing the projection (3) each time your app needs it is a significant performance concern for your application...
Then,
you might consider using a chainmap to create a view over the collection of mappings.
But this is all after-the-fact justification. The Python guys had a problem, came up with a good solution in the context of their code, then did some extra work to abstract their solution so we could use it if we choose. More power to them. But whether it's appropriate for your problem is up to you to decide.

To imperfectly answer your:
Bonus question: is there a way to use it on Python2.x?
from ConfigParser import _Chainmap as ChainMap
However keep in mind that this isn't a real ChainMap, it inherits from DictMixin and only defines:
__init__(self, *maps)
__getitem__(self, key)
keys(self)
# And from DictMixin:
__iter__(self)
has_key(self, key)
__contains__(self, key)
iteritems(self)
iterkeys(self)
itervalues(self)
values(self)
items(self)
clear(self)
setdefault(self, key, default=None)
pop(self, key, *args)
popitem(self)
update(self, other=None, **kwargs)
get(self, key, default=None)
__repr__(self)
__cmp__(self, other)
__len__(self)
Its implementation also doesn't seem particularly efficient.

Given an arbitrary collection, is there a way to tell if it is ordered?

Here's what I have so far:
def is_ordered(collection):
if isinstance(collection, set):
return False
if isinstance(collection, list):
return True
if isinstance(collection, dict):
return False
raise Exception("unknown collection")
Is there a much better way to do this?
NB: I do mean ordered and not sorted.
Motivation:
I want to iterate over an ordered collection. e.g.
def most_important(priorities):
for p in priorities:
print p
In this case the fact that priorities is ordered is important. What kind of collection it is is not. I'm trying to live duck-typing here. I have frequently been dissuaded by from type checking by Pythonistas.

If the collection is truly arbitrary (meaning it can be of any class whatsoever), then the answer has to be no.
Basically, there are two possible approaches:
know about every possible class that can be presented to your method, and whether it's ordered;
test the collection yourself by inserting into it every possible combination of keys, and seeing whether the ordering is preserved.
The latter is clearly infeasible. The former is along the lines of what you already have, except that you have to know about every derived class such as collections.OrderedDict; checking for dict is not enough.
Frankly, I think the whole is_ordered check is a can of worms. Why do you want to do this anyway?

Update: In essence, you are trying to unittest the argument passed to you. Stop doing that, and unittest your own code. Test your consumer (make sure it works with ordered collections), and unittest the code that calls it, to ensure it is getting the right results.
In a statically-typed language you would simply restrict yourself to specific types. If you really want to replicate that, simply specify the only types you accept, and test for those. Raise an exception if anything else is passed. It's not pythonic, but it reliably achieves what you want to do
Well, you have two possible approaches:
Anything with an append method is almost certainly ordered; and
If it only has an add method, you can try adding a nonce-value, then iterating over the collection to see if the nonce appears at the end (or, perhaps at one end); you could try adding a second nonce and doing it again just to be more confident.
Of course, this won't work where e.g. the collection is empty, or there is an ordering function that doesn't result in addition at the ends.
Probably a better solution is simply to specify that your code requires ordered collections, and only pass it ordered collections.

I think that enumerating the 90% case is about as good as you're going to get (if using Python 3, replace basestring with str). Probably also want to consider how you would handle generator expressions and similar ilk, too (again, if using Py3, skip the xrangor):
generator = type((i for i in xrange(0)))
enumerator = type(enumerate(range(0)))
xrangor = type(xrange(0))
is_ordered = lambda seq : isinstance(seq,(tuple, list, collections.OrderedDict,
basestring, generator, enumerator, xrangor))
If your callers start using itertools, then you'll also need to add itertools types as returned by islice, imap, groupby. But the sheer number of these special cases really starts to point to a code smell.

What if the list is not ordered, e.g. [1,3,2]?

Dynamic typing design : is recursivity for dealing with lists a good design?

Lacking experience with maintaining dynamic-typed code, I'm looking for the best way to handle this kind of situations :
(Example in python, but could work with any dynamic-typed language)
def some_function(object_that_could_be_a_list):
if isinstance(object_that_could_be_a_list, list):
for element in object_that_could_be_a_list:
some_function(element)
else:
# Do stuff that expects the object to have certain properties
# a list would not have
I'm quite uneasy with this, since I think a method should do only one thing, and I'm thinking that it is not as readable as it should be. So, I'd be tempted to make three functions : the first that'll take any object and "sort" between the two others, one for the lists, another for the "simple" objects. Then again, that'd add some complexity.
What is the most "sustainable" solution here, and the one that guarantee ease of maintenance ? Is there an idiom in python for those situations that I'm unaware of ? Thanks in advance.

Don't type check - do what you want to do, and if it won't work, it'll throw an exception which you can catch and manage.
The python mantra is 'ask for forgiveness, not permission'. Type checking takes extra time, when most of the time, it'll be pointless. It also doesn't make much sense in a duck-typed environment - if it works, who cares why type it is? Why limit yourself to lists when other iterables will work too?
E.g:
def some_function(object_that_could_be_a_list):
try:
for element in object_that_could_be_a_list:
some_function(element)
except TypeError:
...
This is more readable, will work in more cases (if I pass in any other iterable which isn't a list, there are a lot) and will often be faster.
Note you are getting terminology mixed up. Python is dynamically typed, but not weakly typed. Weak typing means objects change type as needed. For example, if you add a string and an int, it will convert the string to an int to do the addition. Python does not do this. Dynamic typing means you don't declare a type for a variable, and it may contain a string at some point, then an int later.
Duck typing is a term used to describe the use of an object without caring about it's type. If it walks like a duck, and quacks like a duck - it's probably a duck.
Now, this is a general thing, and if you think your code will get the 'wrong' type of object more often than the 'right', then you might want to type check for speed. Note that this is rare, and it's always best to avoid premature optimisation. Do it by catching exceptions, and then test - if you find it's a bottleneck, then optimise.

A common practice is to implement the multiple interface by way of using different parameters for different kinds of input.
def foo(thing=None, thing_seq=None):
if thing_seq is not None:
for _thing in thing_seq:
foo(thing=_thing)
if thing is not None:
print "did foo with", thing

Rather than doing it recursive I tend do it this way:
def foo(x):
if not isinstance(x, list):
x = [x]
for y in x:
do_something(y)

You can use decorators in this case to make it more maintainable:
from mm import multimethod
#multimethod(int, int)
def foo(a, b):
...code for two ints...
#multimethod(float, float):
def foo(a, b):
...code for two floats...
#multimethod(str, str):
def foo(a, b):
...code for two strings...

How Awful is My Decorator?

I recently created a #sequenceable decorator, that can be applied to any function that takes one argument, and causes it to automatically be applicable to any sequence. This is the code (Python 2.5):
def sequenceable(func):
def newfunc(arg):
if hasattr(arg, '__iter__'):
if isinstance(arg, dict):
return dict((k, func(v)) for k, v in arg.iteritems())
else:
return map(func, arg)
else:
return func(arg)
return newfunc
In use:
#sequenceable
def unixtime(dt):
return int(dt.strftime('%s'))
>>> unixtime(datetime.now())
1291318284
>>> unixtime([datetime.now(), datetime(2010, 12, 3)])
[1291318291, 1291352400]
>>> unixtime({'start': datetime.now(), 'end': datetime(2010, 12, 3)})
{'start': 1291318301, 'end': 1291352400}
My questions are:
Is this a terrible idea, and why?
Is this a possibly good idea, but has significant drawbacks as implemented?
What are the potential pitfalls of
using this code?
Is there a builtin or library that
already does what I'm doing?

This is a terrible idea. This is essentially loose typing. Duck-typing is as far as this
stuff should be taken, IMO.
Consider this:
def pluralize(f):
def on_seq(seq):
return [f(x) for x in seq]
def on_dict(d):
return dict((k, f(v)) for k, v in d.iteritems())
f.on_dict = on_dict
f.on_seq = on_seq
return f
Your example then becomes
#pluralize
def unixtime(dt):
return int(dt.strftime('%s'))
unixtime.on_seq([datetime.now(), datetime(2010, 12, 3)])
unixtime.on_dict({'start': datetime.now(), 'end': datetime(2010, 12, 3)})
Doing it this way still requires the caller to know (to within duck-typing accuracy) what is being passed in and doesn't add any typechecking overhead to the actual function. It will also work with any dict-like object whereas your original solution depends on it being an actual subclass of dict.

In my opinion, you seem to be building logic into the wrong place. Why should unixtime know anything about sequencing? In some cases it would be a good idea (for performance or even semantics) but here it seems like you're adding extra features to unixtime that don't make sense in that context.
Better is just to use (say) a list comprehension:
[unixtime(x) for x in [datetime.now(), datetime(2010, 12, 3)]]
that way, you're using the proper Pythonic construct for applying the same thing to a sequence, and you're not polluting unixtime with ideas about sequences. You end up coupling logic (about sequencing) in places where the implementation should be free of that knowledge.
EDIT:
It basically comes down to coding style, reusability and maintainability. You want well partitioned code, so that when you're coding unixtime (say), you're concerned exclusively with converting to a unixtime. Then, if you're interested in sequences, you design functionality (or use built-in syntax) that is exclusively concerned with sequences. That makes it easier to think clearly about each operation, test, debug and reuse code. Think about it even in terms of the name: the original function is appropriately called unixtime, but your sequenced version might more appropriately be called unixtime_sequence, which is weird and suggests an unusual function.
Sometimes, of course, you break that rule. If (but only when) performance is an issue, you might combine functionality. But in general, partitioning things at first into clear parts leads to clear thinking, clear coding and easy reuse.

I am not a huge fan of trying to help out callers too much. Python is expressive enough that it's not a big deal for the caller to handle the "listification". It's easy enough for a caller to write out the dict comprehension or map call.
As a Python programmer that's what I would expect to have to do since the Python standard library functions don't help me out this way. This idiom actually makes me a little more confused because now I have to try to remember what methods are "helpful" and which aren't.
Being too flexible is a minor gripe I have with the Python-based build tool SCons. Its methods are all very accommodating. If you want to set some preprocessor defines you can give it a string, a list of strings, tuples, a dict, etc. It's very convenient but a bit overwhelming.
env = Environment(CPPDEFINES='xyz') # -Dxyz
env = Environment(CPPDEFINES=[('B', 2), 'A']) # -DB=2 -DA
env = Environment(CPPDEFINES={'B':2, 'A':None}) # -DA -DB=2

Not pythonic because:
- in Python explicit is considered better than implicit
- it is not the standard idiom like using builtin map or a list comprehension

#sequence
def distance(points):
return sqrt(reduce(
lambda a, b: a + b,
(a**2 for a in points),
0))
And your decorator becomes useless. Your decorator can be applied only for a special cases, and if you'll have using it, you'll break one of Python Zen's rule: "There should be one-- and preferably only one --obvious way to do it".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.