Most Pythonic way for creating a defaultdictionary counter - python

I am trying to count occurrences of various items based on condition. What I have until now is this function that given two items will increase the counter like this:
given [('a', 'a'), ('a', 'b'), ('b', 'a')] will output defaultdict(<class 'collections.Counter'>, {'a': Counter({'a': 1, 'b': 1}), 'b': Counter({'a': 1})
the function can be seen bellow
def freq(samples=None):
out = defaultdict(Counter)
if samples:
for (c, s) in samples:
out[c][s] += 1
return out
It is limited though to only work with tuples while I would like it to be more generic and work with any number of variables e.g., [('a', 'a', 'b'), ('a', 'b', 'c'), ('b', 'a', 'a')] would still work and I would be able to query the result for lets say res['a']['b'] and get the count for 'c' that is one.
What would be the best way to do this in Python?

Assuming all tuples in the list have the same length:
from collections import Counter
from itertools import groupby
from operator import itemgetter
def freq(samples=[]):
sorted_samples = sorted(samples)
if sorted_samples and len(sorted_samples[0]) > 2:
return {key: freq(value[1:] for value in values) for key, values in groupby(sorted_samples, itemgetter(0))}
else:
return {key: Counter(value[1] for value in values) for key, values in groupby(sorted_samples, itemgetter(0))}
That gives:
freq([('a', 'a'), ('a', 'b'), ('b', 'a'), ('a', 'c')])
>>> {'a': Counter({'a': 1, 'b': 1, 'c': 1}), 'b': Counter({'a': 1})}
freq([('a', 'a', 'a'), ('a', 'b', 'c'), ('b', 'a', 'a'), ('a', 'c', 'c')])
>>> {'a': {'a': Counter({'a': 1}), 'b': Counter({'c': 1}), 'c': Counter({'c': 1})}, 'b': {'a': Counter({'a': 1})}}

One option is to use the full tuples as keys
def freq(samples=[]):
out = Counter()
for sample in samples:
out[sample] += 1
return out
which would then return things as
Counter({('a', 'a', 'b'): 1, ('a', 'b', 'c'): 1, ('b', 'a', 'a'): 1})
You could convert the tuples to strings to select certain slices, e.g. "('a', 'b',". For example in a new dictionary {k: v for k,v in out.items() if str(k)[:10] == "('a', 'b',"}.
If the groups are indeed either 2 or 3 long, but never both, you can change to:
def freq(samples):
l = len(samples[0])
if l == 2:
out = defaultdict(lambda: 0)
for a, b in samples:
out[a][b] += 1
elif l == 3:
out = defaultdict(lambda: defaultdict(lambda: 0))
for a, b, c in samples:
out[a][b][c] += 1
return out

Related

Pandas dataframe to dict of list of tuples

Suppose I have the following dataframe:
df = pd.DataFrame({'id': [1,2,3,3,3], 'v1': ['a', 'a', 'c', 'c', 'd'], 'v2': ['z', 'y', 'w', 'y', 'z']})
df
id v1 v2
1 a z
2 a y
3 c w
3 c y
3 d z
And I want to transform it to this format:
{1: [('a', 'z')], 2: [('a', 'y')], 3: [('c', 'w'), ('c', 'y'), ('d', 'z')]}
I basically want to create a dict where the keys are the id and the values is a list of tuples of the (v1,v2) of this id.
I tried using groupby in id:
df.groupby('id')[['v1', 'v2']].apply(list)
But this didn't work
Create tuples first and then pass to groupby with aggregate list:
d = df[['v1', 'v2']].agg(tuple, 1).groupby(df['id']).apply(list).to_dict()
print (d)
{1: [('a', 'z')], 2: [('a', 'y')], 3: [('c', 'w'), ('c', 'y'), ('d', 'z')]}
Another idea is using MultiIndex:
d = df.set_index(['v1', 'v2']).groupby('id').apply(lambda x: x.index.tolist()).to_dict()
You can use defaultdict from the collections library :
from collections import defaultdict
d = defaultdict(list)
for k, v, s in df.to_numpy():
d[k].append((v, s))
defaultdict(list,
{1: [('a', 'z')],
2: [('a', 'y')],
3: [('c', 'w'), ('c', 'y'), ('d', 'z')]})
df['New'] = [tuple(x) for x in df[['v1','v2']].to_records(index=False)]
df=df[['id','New']]
df=df.set_index('id')
df.to_dict()
Output:
{'New': {1: ('a', 'z'), 2: ('a', 'y'), 3: ('d', 'z')}}

Calculating total and relative frequency of values in a dict representing a Markov-chain rule

I have made a function make_rule(text, scope=1) that simply goes over a string and generates a dictionary that serves as a rule for a Markovian text-generator (where the scope is the number of linked characters, not words).
>>> rule = make_rule("abbcad", 1)
>>> rule
{'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}
I have been tasked with calculating the entropy of this system. In order to do that I think I would need to know:
How often a value appears in the dictionary in total, i.e. its total frequency.
How often a value appears given a key in the dictionary, i.e. its relative frequency.
Is there a quick way to get both of these numbers for each of the values in the dictionary?
For the above example I would need this output:
'a' total: 1, 'a'|'a': 0, 'a'|'b': 0, 'a'|'c': 1
'b' total: 2, 'b'|'a': 1, 'b'|'b': 1, 'b'|'c': 0
'c' total: 1, 'c'|'a': 0, 'c'|'b': 1, 'c'|'c': 0
'd' total: 1, 'd'|'a': 1, 'a'|'b': 1, 'a'|'c': 1
I guess the 'a' total is easily inferred, so maybe instead just output a list of triples for every unique item that appears in the dictionary:
[[('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1)], [('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0)], ...]
I'll just deal with "How often a value appears given a key in the dictionary", since you've said that "How often a value appears in the dictionary in total" is easily inferred.
If you just want to be able to look up the relative frequency of a value for a given key, it's easy to get that with a dict of Counter objects:
from collections import Counter
rule = {'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}
freq = {k: Counter(v) for k, v in rule.items()}
… which gives you a freq like this:
{
'a': Counter({'b': 1, 'd': 1}),
'b': Counter({'b': 1, 'c': 1}),
'c': Counter({'a': 1})
}
… so that you can get the relative frequency of 'a' given the key 'c' like this:
>>> freq['c']['a']
1
Because Counter objects return 0 for nonexistent keys, you'll also get zero frequencies as you would expect:
>>> freq['a']['c']
0
If you need a list of 3-tuples as specified in your question, you can get that with a little extra work. Here's a function to do it:
def triples(rule):
freq = {k: Counter(v) for k, v in rule.items()}
all_values = sorted(set().union(*rule.values()))
sorted_keys = sorted(rule)
return [(v, k, freq[k][v]) for v in all_values for k in sorted_keys]
The only thing here which I think may not be self-explanatory is the all_values = ... line, which:
creates an empty set()
produces the union() of that set with all the individual elements of the lists in rule.values() (note the use of the argument-unpacking * operator)
converts the result into a sorted() list.
If you still have the original text, you can avoid all that work by using e.g. all_values = sorted(set(original_text)) instead.
Here it is in action:
>>> triples({'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']})
[
('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1),
('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0),
('c', 'a', 0), ('c', 'b', 1), ('c', 'c', 0),
('d', 'a', 1), ('d', 'b', 0), ('d', 'c', 0)
]
I cannot think of a quick way other than iterating over the word's characters, counting the occurences in each list of the dictionary and summing it in the end:
alphabet = sorted(set("abbcad"))
rule = {'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}
totalMatrix = []
for elem in alphabet:
total = 0
occurences = []
for key in rule.keys():
currentCount = rule[key].count(elem)
total += currentCount
occurences.append((elem,key,currentCount))
totalMatrix.append([elem, total] + occurences)
for elem in totalMatrix:
print(elem)
The content of totalMatrix will be:
['a', 1, ('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1)]
['b', 2, ('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0)]
['c', 1, ('c', 'a', 0), ('c', 'b', 1), ('c', 'c', 0)]
['d', 1, ('d', 'a', 1), ('d', 'b', 0), ('d', 'c', 0)]

Create a new dictionary using an existing one and list

If I had:
adict = {'a':3, 'b':6, 'c':9, 'd':12}
alist = ['a', 'z', 't', 's']
How would I create a new dict with the keys of the first dict and the items of the list, resulting in this?
bdict = {'a': 'a', 'b': 'z', 'c': 't', 'd': 's'}
To bring the keys of adict together the values from alist use the zip() function.
>>> from collections import OrderedDict
>>> adict = OrderedDict([('a', 3), ('b', 6), ('c', 9), ('d', 12)])
>>> alist = ['a', 'z', 't', 's']
>>> bdict = OrderedDict(zip(adict, alist))
>>> bdict
OrderedDict([('a', 'a'), ('b', 'z'), ('c', 't'), ('d', 's')])
I've used ordered dictionaries here because the question only makes sense if the dictionaries are OrderedDicts; otherwise, you can't guarantee the pairwise one-to-one correspondence between adict and alist.

Matrix file to dictionary in python

I have a file matrix.txt that contains :
A B C
A 1 2 3
B 4 5 6
C 7 8 9
I want to read the content of the file and store it in a dictionary as following :
{('A', 'A') : 1, ('A', 'B') : 2, ('A', 'C') : 3,
('B', 'A') : 4, ('B', 'B') : 5, ('B', 'C') : 6,
('C', 'A') : 7, ('C', 'B') : 8, ('C', 'C') : 9}
The following Python3 function will yield all matrix items with it's indices, compatible with dict constructor:
def read_mx_cells(file, parse_cell = lambda x:x):
rows = (line.rstrip().split() for line in file)
header = next(rows)
for row in rows:
row_id = row[0]
for col_id,cell in zip(header, row[1:]):
yield ((row_id, col_id), parse_cell(cell))
with open('matrix.txt') as f:
for x in read_mx_cells(f, int):
print(x)
# ('A','A'),1
# ('A','B'),2
# ('A','C'),3 ...
with open('matrix.txt') as f:
print(dict(read_mx_cells(f, int)))
# { ('A','A'): 1, ('A','B'): 2, ('A','C'): 3 ... }
# Note that python dicts dont retain item order
You can use itertools.product to create your keys, using the file header and the first column after transposing to create the keys, then just zip transforming the remaining rows back to their original state and creating a single iterable of the split substrings. To maintain order we also need to use an OrderedDict:
from collections import OrderedDict
from itertools import izip, product, imap, chain
with open("matrix.txt") as f:
head, zipped = next(f).split(), izip(*imap(str.split, f))
cols = next(zipped)
od = OrderedDict(zip(product(head, cols), chain.from_iterable(izip(*zipped))))
Output:
OrderedDict([(('A', 'A'), '1'), (('A', 'B'), '2'), (('A', 'C'), '3'),
(('B', 'A'), '4'), (('B', 'B'), '5'), (('B', 'C'), '6'), (('C', 'A'), '7'),
(('C', 'B'), '8'), (('C', 'C'), '9')])
For python3 just use map and zip.
Or without transposing and using the csv lib:
from collections import OrderedDict
from itertools import izip,repeat
import csv
with open("matrix.txt") as f:
r = csv.reader(f, delimiter=" ", skipinitialspace=1)
head = repeat(next(r))
od = OrderedDict((((row[0], k), v) for row in r
for k, v in izip(next(head), row[1:])))
output will be the same.
pandas makes it pretty neat.
import pandas as pd
Approach 1
df = pd.read_table('matrix.txt', sep=' ')
>>> df
A B C
A 1 2 3
B 4 5 6
C 7 8 9
d = df.to_dict()
>>> d
{'A': {'A': 1, 'B': 4, 'C': 7},
'B': {'A': 2, 'B': 5, 'C': 8},
'C': {'A': 3, 'B': 6, 'C': 9}}
new_d = {}
{new_d.update(g) for g in [{(r,c):v for r,v in v1.iteritems()} for c,v1 in d.iteritems()]}
>>> new_d
{('A', 'A'): 1,
('A', 'B'): 2,
('A', 'C'): 3,
('B', 'A'): 4,
('B', 'B'): 5,
('B', 'C'): 6,
('C', 'A'): 7,
('C', 'B'): 8,
('C', 'C'): 9}
Approach 2
df = pd.read_table('matrix.txt', sep=' ')
>>> df
A B C
A 1 2 3
B 4 5 6
C 7 8 9
new_d = {}
for r, v in df.iterrows():
for c, v1 in v.iteritems():
new_d.update({(r,c): v1})
>>> new_d
{('A', 'A'): 1,
('A', 'B'): 2,
('A', 'C'): 3,
('B', 'A'): 4,
('B', 'B'): 5,
('B', 'C'): 6,
('C', 'A'): 7,
('C', 'B'): 8,
('C', 'C'): 9}

create dictionary from list - in sequence

I would like to create a dictionary from list
>>> list=['a',1,'b',2,'c',3,'d',4]
>>> print list
['a', 1, 'b', 2, 'c', 3, 'd', 4]
I use dict() to produce dictionary from list
but the result is not in sequence as expected.
>>> d = dict(list[i:i+2] for i in range(0, len(list),2))
>>> print d
{'a': 1, 'c': 3, 'b': 2, 'd': 4}
I expect the result to be in sequence as the list.
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
Can you guys please help advise?
Dictionaries don't have any order, use collections.OrderedDict if you want the order to be preserved. And instead of using indices use an iterator.
>>> from collections import OrderedDict
>>> lis = ['a', 1, 'b', 2, 'c', 3, 'd', 4]
>>> it = iter(lis)
>>> OrderedDict((k, next(it)) for k in it)
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
Dictionary is an unordered data structure. To preserve order use collection.OrderedDict:
>>> lst = ['a',1,'b',2,'c',3,'d',4]
>>> from collections import OrderedDict
>>> OrderedDict(lst[i:i+2] for i in range(0, len(lst),2))
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
You could use the grouper recipe: zip(*[iterable]*n) to collect the items into groups of n:
In [5]: items = ['a',1,'b',2,'c',3,'d',4]
In [6]: items = iter(items)
In [7]: dict(zip(*[items]*2))
Out[7]: {'a': 1, 'b': 2, 'c': 3, 'd': 4}
PS. Never name a variable list, since it shadows the builtin (type) of the same name.
The grouper recipe is easy to use, but a little harder to explain.
Items in a dict are unordered. So if you want the dict items in a certain order, use a collections.OrderedDict (as falsetru already pointed out):
In [13]: collections.OrderedDict(zip(*[items]*2))
Out[13]: OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])

Categories

Resources