Summing up collections.Counter objects using `groupby` in pandas - python

I am trying to group the words_count column by both essay_Set and domain1_score and adding the counters in words_count to add the counters results as mentioned here:
>>> c = Counter(a=3, b=1)
>>> d = Counter(a=1, b=2)
>>> c + d # add two counters together: c[x] + d[x]
Counter({'a': 4, 'b': 3})
I grouped them using this command:
words_freq_by_set = words_freq_by_set.groupby(by=["essay_set", "domain1_score"]) but do not know how to pass the Counter addition function to apply it on words_count column which is simply +.
Here is my dataframe:

GroupBy.sum works with Counter objects. However I should mention the process is pairwise, so this may not be very fast. Let's try
words_freq_by_set.groupby(by=["essay_set", "domain1_score"])['words_count'].sum()
df = pd.DataFrame({
'a': [1, 1, 2],
'b': [Counter([1, 2]), Counter([1, 3]), Counter([2, 3])]
})
df
a b
0 1 {1: 1, 2: 1}
1 1 {1: 1, 3: 1}
2 2 {2: 1, 3: 1}
df.groupby(by=['a'])['b'].sum()
a
1 {1: 2, 2: 1, 3: 1}
2 {2: 1, 3: 1}
Name: b, dtype: object

Related

Python find frequency of numbers in list of lists

I have a list of lists of int as shown below
[[1, 2, 3],
[1, 5],
[4, 2, 6]]
I want to generate the frequency of the numbers in the lists as a dict, for example 1 occurs in 2 of the lists, and so on, expected output is
{1:2,
2:2,
3:1,
4:1,
5:1,
6:1}
How can this be generated?
You could try this:
L is your list of list.
expected = {1:2,
2:2,
3:1,
4:1,
5:1,
6:1}
>>> from itertools import chain
>>> from collections import Counter
>>> flattened = list(chain.from_iterable(L))
>>> flattened
[1, 2, 3, 1, 5, 4, 2, 6]
>>> counts = Counter(flattened)
>>> counts
Counter({1: 2, 2: 2, 3: 1, 5: 1, 4: 1, 6: 1})
# It's easy to make it to a function or one-liner too.
>>> counts = Counter(chain.from_iterable(L))
>>> assert counts == expected # your expected result shown above
# silence means matching.
you can use Counter for this
>>> from collections import Counter as c
>>> array = [[1, 2, 3],[1,5],[4,2,6]]
>>> result = c()
>>> for sublist in array:
... result += c(sublist)
...
>>> result
Counter({1: 2, 2: 2, 3: 1, 5: 1, 4: 1, 6: 1})

Initializing a dictionary with zeroes [duplicate]

This question already has answers here:
How do I count the occurrences of a list item?
(30 answers)
Closed 3 years ago.
I am trying to track seen elements, from a big array, using a dict.
Is there a way to force a dictionary object to be integer type and set to zero by default upon initialization?
I have done this with a very clunky codes and two loops.
Here is what I do now:
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = {}
for val in fl:
seenit[val] = 0
for val in fl:
seenit[val] = seenit[val] + 1
Of course, just use collections.defaultdict([default_factory[, ...]]):
from collections import defaultdict
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = defaultdict(int)
for val in fl:
seenit[val] += 1
print(fl)
# Output
defaultdict(<class 'int'>, {0: 1, 1: 3, 2: 1, 3: 1, 4: 1})
print(dict(seenit))
# Output
{0: 1, 1: 3, 2: 1, 3: 1, 4: 1}
In addition, if you don't like to import collections you can use dict.get(key[, default])
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = {}
for val in fl:
seenit[val] = seenit.get(val, 0) + 1
print(seenit)
# Output
{0: 1, 1: 3, 2: 1, 3: 1, 4: 1}
Also, if you only want to solve the problem and don't mind to use exactly dictionaries you may use collection.counter([iterable-or-mapping]):
from collections import Counter
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = Counter(f)
print(seenit)
# Output
Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})
print(dict(seenit))
# Output
{0: 1, 1: 3, 2: 1, 3: 1, 4: 1}
Both collection.defaultdict and collection.Counter can be read as dictionary[key] and supports the usage of .keys(), .values(), .items(), etc. Basically they are a subclass of a common dictionary.
If you want to talk about performance I checked with timeit.timeit() the creation of the dictionary and the loop for a million of executions:
collection.defaultdic: 2.160868141 seconds
dict.get: 1.3540439499999999 seconds
collection.Counter: 4.700308418999999 seconds
collection.Counter may be easier, but much slower.
You can use collections.Counter:
from collections import Counter
Counter([0, 1, 1, 2, 1, 3, 4])
Output:
Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})
You can then address it like a dictionary:
>>> Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})[1]
3
>>> Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})[0]
1
Using val in seenit is a bit faster than .get():
seenit = dict()
for val in fl:
if val in seenit :
seenit[val] += 1
else:
seenit[val] = 1
For larger lists, Counter will eventually outperform all other approaches. and defaultdict is going to be faster than using .get() or val in seenit.

nested dictionary of bin sizes from groupby multiple columns

df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3], 'b': [5,5,1,1,3,3,3,1,2,1,1,1,]})
>>> df
a b
0 1 5
1 1 5
2 1 1
3 1 1
4 2 3
5 2 3
6 2 3
7 2 1
8 3 2
9 3 1
10 3 1
11 3 1
>>> df.groupby(['a','b']).size().to_dict()
{(1, 5): 2, (3, 2): 1, (2, 3): 3, (3, 1): 3, (1, 1): 2, (2, 1): 1}
What I am getting is the counts of each a and b combination with a tuple of the pair as key but what I am trying to get to is:
{1: {5: 2, 1: 2}, 2: {3: 3, 1: 1}, 3: {2: 1, 1: 3} }
You'll need an additional groupby inside a dict comprehension:
i = df.groupby(['a','b']).size().reset_index(level=1)
j = {k : dict(g.values) for k, g in i.groupby(level=0)}
print(j)
{
1: {1: 2, 5: 2},
2: {1: 1, 3: 3},
3: {1: 3, 2: 1}
}
You can use collections.defaultdict for an O(n) solution.
from collections import defaultdict
df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3], 'b': [5,5,1,1,3,3,3,1,2,1,1,1,]})**Option 2: defaultdict**
d = defaultdict(lambda: defaultdict(int))
for i, j in map(tuple, df.values):
d[i][j] += 1
# defaultdict(<function __main__.<lambda>>,
# {1: defaultdict(int, {1: 2, 5: 2}),
# 2: defaultdict(int, {1: 1, 3: 3}),
# 3: defaultdict(int, {1: 3, 2: 1})})
from collections import Counter
import pandas as pd
s = pd.Series(Counter(zip(df.a, df.b)))
{
n: d.xs(n).to_dict()
for n, d in s.groupby(level=0)
}
{1: {1: 2, 5: 2}, 2: {1: 1, 3: 3}, 3: {1: 3, 2: 1}}

Appending to dictionary stored in dataframe on value

So I have a DataFrame that looks like the following
a b c
0 AB 10 {a: 2, b: 1}
1 AB 1 {a: 3, b: 2}
2 AC 2 {a: 4, b: 3}
...
400 BC 4 {a: 1, b: 4}
Given another key pair like {c: 2} what's the syntax to add this to every value in row c?
a b c
0 AB 10 {a: 2, b: 1, c: 2}
1 AB 1 {a: 3, b: 2, c: 2}
2 AC 2 {a: 4, b: 3, c: 2}
...
400 BC 4 {a: 1, b: 4, c: 2}
I've tried df['C'] +=, and df['C'].append(), and df.C.append, but neither seem to work.
Here is a generalized way for updating dictionaries in a column with another dictionary, which can be used for multiple keys.
Test dataframe:
>>> x = pd.Series([{'a':2,'b':1}])
>>> df = pd.DataFrame(x, columns=['c'])
>>> df
c
0 {'b': 1, 'a': 2}
And just apply a lambda function:
>>> update_dict = {'c': 2}
>>> df['c'].apply(lambda x: {**x, **update_dict})
0 {'b': 1, 'a': 2, 'c': 2}
Name: c, dtype: object
Note: this uses the Python3 update dictionary syntax mentioned in an answer to How to merge two Python dictionaries in a single expression?. For Python2, you can use the merge_two_dicts function in the top answer. You can use the function definition from that answer and then write:
df['c'].apply(lambda x: merge_two_dicts(x, update_dict))

How to keep my original dict when doing addition of Counter

To my understanding, I know when I invoke Counter to covert dict. This dict includes value of keys is zero will disappear.
from collections import Counter
a = {"a": 1, "b": 5, "d": 0}
b = {"b": 1, "c": 2}
print Counter(a) + Counter(b)
If I want to keep my keys, how to do?
This is my expected result:
Counter({'b': 6, 'c': 2, 'a': 1, 'd': 0})
You can also use the update() method of Counter instead of + operator, example -
>>> a = {"a": 1, "b": 5, "d": 0}
>>> b = {"b": 1, "c": 2}
>>> x = Counter(a)
>>> x.update(Counter(b))
>>> x
Counter({'b': 6, 'c': 2, 'a': 1, 'd': 0})
update() function adds counts instead of replacing them , and it does not remove the zero value one either. We can also do Counter(b) first, then update with Counter(a), Example -
>>> y = Counter(b)
>>> y.update(Counter(a))
>>> y
Counter({'b': 6, 'c': 2, 'a': 1, 'd': 0})
Unfortunately, when summing two counter, only elements with a positive count are used.
If you want to keep the elements with a count of zero, you could define a function like this:
def addall(a, b):
c = Counter(a) # copy the counter a, preserving the zero elements
for x in b: # for each key in the other counter
c[x] += b[x] # add the value in the other counter to the first
return c
You can just subclass Counter and adjust its __add__ method:
from collections import Counter
class MyCounter(Counter):
def __add__(self, other):
"""Add counts from two counters.
Preserves counts with zero values.
>>> MyCounter('abbb') + MyCounter('bcc')
MyCounter({'b': 4, 'c': 2, 'a': 1})
>>> MyCounter({'a': 1, 'b': 0}) + MyCounter({'a': 2, 'c': 3})
MyCounter({'a': 3, 'c': 3, 'b': 0})
"""
if not isinstance(other, Counter):
return NotImplemented
result = MyCounter()
for elem, count in self.items():
newcount = count + other[elem]
result[elem] = newcount
for elem, count in other.items():
if elem not in self:
result[elem] = count
return result
counter1 = MyCounter({'a': 1, 'b': 0})
counter2 = MyCounter({'a': 2, 'c': 3})
print(counter1 + counter2) # MyCounter({'a': 3, 'c': 3, 'b': 0})
I help Anand S Kumar to do more a additional explanation.
Even though your dict includes negative value, it still keep your keys.
from collections import Counter
a = {"a": 1, "b": 5, "d": -1}
b = {"b": 1, "c": 2}
print Counter(a) + Counter(b)
#Counter({'b': 6, 'c': 2, 'a': 1})
x = Counter(a)
x.update(Counter(b))
print x
#Counter({'b': 6, 'c': 2, 'a': 1, 'd': -1})

Categories

Resources