Aggregate List of Map in PySpark

Aggregate List of Map in PySpark - python

I have a list of map e.g
[{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20} }
I want to get the average of values of a and b. So the expected output is
a = (10 + 5 + 0 + 0) /3 = 5 ;
b = 80/4 = 20.
How can i do it efficiently using RDD

The easiest might be map your rdd element to a format like:
init = {'a': {'sum': 0, 'cnt': 0}, 'b': {'sum': 0, 'cnt': 0}}
i.e. record the sum and count for each key, and then reduce it.
Map function:
def map_fun(d, keys=['a', 'b']):
map_d = {}
for k in keys:
if k in d:
temp = {'sum': d[k], 'cnt': 1}
else:
temp = {'sum': 0, 'cnt': 0}
map_d[k] = temp
return map_d
Reduce function:
def reduce_fun(a, b, keys=['a', 'b']):
from collections import defaultdict
reduce_d = defaultdict(dict)
for k in keys:
reduce_d[k]['sum'] = a[k]['sum'] + b[k]['sum']
reduce_d[k]['cnt'] = a[k]['cnt'] + b[k]['cnt']
return reduce_d
rdd.map(map_fun).reduce(reduce_fun)
# defaultdict(<type 'dict'>, {'a': {'sum': 15, 'cnt': 3}, 'b': {'sum': 80, 'cnt': 4}})
Calculate the average:
d = rdd.map(map_fun).reduce(reduce_fun)
{k: v['sum']/v['cnt'] for k, v in d.items()}
{'a': 5, 'b': 20}

Given the structure of your data you should be able to use the dataframe api to achieve this calculation. If you need an rdd it is not to hard to get from the dataframe back to an rdd.
from pyspark.sql import functions as F
df = spark.createDataFrame([{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}])
Dataframe looks like this
+----+---+
| a| b|
+----+---+
| 10| 20|
| 5| 20|
|null| 20|
| 0| 20|
+----+---+
Then it follows simply to calculate averages using the pyspark.sql functions
cols = df.columns
df_means = df.agg(*[F.mean(F.col(col)).alias(col+"_mean") for col in cols])
df_means.show()
OUTPUT:
+------+------+
|a_mean|b_mean|
+------+------+
| 5.0| 20.0|
+------+------+

You can use defaultdict to collect similar keys and their values as list.
Then simply aggregate using sum of values divided by number of elements of list for each value.
from collections import defaultdict
x = [{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}]
y = defaultdict(lambda: [])
[y[k].append(v) for i in x for k,v in i.items() ]
for k,v in y.items():
print k, "=" ,sum(v)/len(v)
>>> y
defaultdict(<function <lambda> at 0x02A43BB0>, {'a': [10, 5, 0], 'b': [20, 20, 20, 20]})
>>>
>>>
a = 5
b = 20

Related

How to get dictionary of df indices that links the same ids on different days?

I've following toy-dataframe:
| id| date
--------------
0 | a | d1
1 | b | d1
2 | a | d2
3 | c | d2
4 | b | d3
5 | a | d3
import pandas as pd
df = pd.DataFrame({'id': ['a', 'b', 'a', 'c', 'b', 'a'], 'date': ['d1', 'd1', 'd2', 'd2', 'd3', 'd3']})
I want to obtaining 'linking dicitionary', like this: d = {0: 2, 2: 5, 1: 4},
where (numbers are just row index)
0:2 means link a from d1 to a from d2,
2:5 means link a from d2 to a from d3,
1:4 means link b from d1 to b from d3
Is there some simple and clean way to get it?

You can use groupby and reduce:
from functools import reduce
d = df.groupby('id').apply(lambda x: dict(zip(x.index, x.index[1:])))
d = reduce(lambda d1, d2: {**d1, **d2}, d) # or reduce(lambda d1, d2: d1 | d2, d)
print(d)
# Output
{0: 2, 2: 5, 1: 4}

Use dictionary comprehension:
d = {k: v for _, x in df.groupby('id') for k, v in zip(x.index, x.index[1:])}
print (d)
{0: 2, 2: 5, 1: 4}

Write struct columns to parquet with pyarrow

I have the following dataframe and schema:
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
SCHEMA = pa.schema([("a_and_b", pa.struct([('a', pa.int64()), ('b', pa.int64())])), ('c', pa.int64())])
Then I want to create a pyarrow table from df and save it to parquet with this schema. However, I could not find a way to create a proper type in pandas that would correspond to a struct type in pyarrow. Is there a way to do this?

For pa.struct convertion from pandas you can use a tuples (eg: [(1, 4), (2, 5), (3, 6)]):
df_with_tuples = pd.DataFrame({
"a_and_b": zip(df["a"], df["b"]),
"c": df["c"]
})
pa.Table.from_pandas(df_with_tuples, SCHEMA)
or dict [{'a': 1, 'b': 2}, {'a': 4, 'b': 5}, {'a': 7, 'b': 8}]:
df_with_dict = pd.DataFrame({
"a_and_b": df.apply(lambda x: {"a": x["a"], "b": x["b"] }, axis=1),
"c": df["c"]
})
pa.Table.from_pandas(df_with_dict , SCHEMA)
When converting back from arrow to pandas, struct are represented as dict:
pa.Table.from_pandas(df_with_dict , SCHEMA).to_pandas()['a_and_b']
| a_and_b |
|:-----------------|
| {'a': 1, 'b': 2} |
| {'a': 4, 'b': 5} |
| {'a': 7, 'b': 8} |

Dataframe to Dictionary including List of dictionaries

I am trying to convert below dataframe to dictionary.
I want to group via column A and take a list of common sequence. for e.g.
Example 1:
n1 v1 v2
2 A C 3
3 A D 4
4 A C 5
5 A D 6
Expected output:
{'A': [{'C':'3','D':'4'},{'C':'5','D':'6'}]}
Example 2:
n1 n2 v1 v2
s1 A C 3
s1 A D 4
s1 A C 5
s1 A D 6
s1 B P 6
s1 B Q 3
Expected Output:
{'s1': {'A': [{'C': 3, 'D': 4}, {'C': 5, 'D': 6}], 'B': {'P': 6, 'Q': 3}}}
so basically C and D are repeating as a sequence,I want to club C and D in one dictionary and make a list of if it occurs multiple times.
Please note (Currently I am using below code):
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.iloc[:,1:]) for k,g in grouped}
return d
This returns :
{s1 : {'A': {'C': array(['3', '5'], dtype=object), 'D': array(['4', '6'], dtype=object),'B':{'E':'5','F':'6'}}
Also, there can be another series of s2 having E,F,G,E,F,G repeating and some X and Y having single values

Lets create a function dictify which create a dictionary with top level keys from name column and club's the repeating occurrences of values in column v1 into different sub dictionaries:
from collections import defaultdict
def dictify(df):
dct = defaultdict(list)
for k, g in df.groupby(['n1', df.groupby(['n1', 'v1']).cumcount()]):
dct[k[0]].append(dict([*g[['v1', 'v2']].values]))
return dict(dct)
dictify(df)
{'A': [{'C': 3, 'D': 4}, {'C': 5, 'D': 6}]}
UPDATE:
In case there can be variable number of primary grouping keys i.e. [n1, n2, ...] we can use a more generic method:
def update(dct, keys, val):
k, *_ = keys
dct[k] = update(dct.get(k, {}), _, val) if _ \
else [*np.hstack([dct[k], [val]])] if k in dct else val
return dct
def dictify(df, keys):
dct = dict()
for k, g1 in df.groupby(keys):
for _, g2 in g1.groupby(g1.groupby('v1').cumcount()):
update(dct, k, dict([*g2[['v1', 'v2']].values]))
return dict(dct)
dictify(df, ['n1', 'n2'])
{'s1': {'A': [{'C': 3, 'D': 4}, {'C': 5, 'D': 6}], 'B': {'P': 6, 'Q': 3}}}

Here is a simple one line statement that solves your problem:
def df_to_dict(df):
return {name: [dict(x.to_dict('split')['data'])
for _, x in d.drop('name', 1).groupby(d.index // 2)]
for name, d in df.groupby('name')}
Here is an example:
df = pd.DataFrame({'name': ['A'] * 4,
'v1': ['C', 'D'] * 2,
'v2': [3, 4, 5, 6]})
print(df_to_dict(df))
Output:
{'A': [{'C': 3, 'D': 4}, {'C': 5, 'D': 6}]}

Add values from two dictionaries

dict1 = {a: 5, b: 7}
dict2 = {a: 3, c: 1}
result {a:8, b:7, c:1}
How can I get the result?

this is a one-liner that would do just that:
dict1 = {'a': 5, 'b': 7}
dict2 = {'a': 3, 'c': 1}
result = {key: dict1.get(key, 0) + dict2.get(key, 0)
for key in set(dict1) | set(dict2)}
# {'c': 1, 'b': 7, 'a': 8}
note that set(dict1) | set(dict2) is the set of the keys of both your dictionaries. and dict1.get(key, 0) returns dict1[key] if the key exists, 0 otherwise.
this works on a more recent python version:
{k: dict1.get(k, 0) + dict2.get(k, 0) for k in dict1.keys() | dict2.keys()}

You can use collections.Counter which implements addition + that way:
>>> from collections import Counter
>>> dict1 = Counter({'a': 5, 'b': 7})
>>> dict2 = Counter({'a': 3, 'c': 1})
>>> dict1 + dict2
Counter({'a': 8, 'b': 7, 'c': 1})
if you really want the result as dict you can cast it back afterwards:
>>> dict(dict1 + dict2)
{'a': 8, 'b': 7, 'c': 1}

Here is a nice function for you:
def merge_dictionaries(dict1, dict2):
merged_dictionary = {}
for key in dict1:
if key in dict2:
new_value = dict1[key] + dict2[key]
else:
new_value = dict1[key]
merged_dictionary[key] = new_value
for key in dict2:
if key not in merged_dictionary:
merged_dictionary[key] = dict2[key]
return merged_dictionary
by writing:
dict1 = {'a': 5, 'b': 7}
dict2 = {'a': 3, 'c': 1}
result = merge_dictionaries(dict1, dict2)
result will be:
{'a': 8, 'b': 7, 'c': 1}

A quick dictionary comprehension that should work on any classes which accept the + operator. Performance might not be optimal.
{
**dict1,
**{ k:(dict1[k]+v if k in dict1 else v)
for k,v in dict2.items() }
}

Here is another approach but it is quite lengthy!
d1 = {'a': 5, 'b': 7}
d2 = {'a': 3, 'c': 1}
d={}
for i,j in d1.items():
for k,l in d2.items():
if i==k:
c={i:j+l}
d.update(c)
for i,j in d1.items():
if i not in d:
d.update({i:j})
for m,n in d2.items():
if m not in d:
d.update({m:n})

Think it's much simpler.
a={'a':3, 'b':5}
b= {'a':4, 'b':7}
{i:a[i]+b[i] for i in a.keys()}
Output: {'a': 7, 'b': 12}

Updating a dictionary in python

I've been stuck on this question for quite sometime and just can't figure it out. I just want to be able to understand what I'm missing and why it's needed.
What I need to do is make a function which adds each given key/value pair to the dictionary. The argument key_value_pairs will be a list of tuples in the form (key, value).
def add_to_dict(d, key_value_pairs):
newinputs = [] #creates new list
for key, value in key_value_pairs:
d[key] = value #updates element of key with value
if key in key_value_pairs:
newinputs.append((d[key], value)) #adds d[key and value to list
return newinputs
I can't figure out how to update the "value" variable when d and key_value_pairs have different keys.
The first three of these scenarios work but the rest fail
>>> d = {}
>>> add_to_dict(d, [])
[]
>>> d
{}
>>> d = {}
>>> add_to_dict(d, [('a', 2])
[]
>>> d
{'a': 2}
>>> d = {'b': 4}
>>> add_to_dict(d, [('a', 2)])
[]
>>> d
{'a':2, 'b':4}
>>> d = {'a': 0}
>>> add_to_dict(d, [('a', 2)])
[('a', 0)]
>>> d
{'a':2}
>>> d = {'a', 0, 'b': 1}
>>> add_to_dict(d, [('a', 2), ('b': 4)])
[('a', 2), ('b': 1)]
>>> d
{'a': 2, 'b': 4}
>>> d = {'a': 0}
>>> add_to_dict(d, [('a', 1), ('a': 2)])
[('a', 0), ('a':1)]
>>> d
{'a': 2}
Thanks
Edited.

Python has this feature built-in:
>>> d = {'b': 4}
>>> d.update({'a': 2})
>>> d
{'a': 2, 'b': 4}
Or given you're not allowed to use dict.update:
>>> d = dict(d.items() + {'a': 2}.items()) # doesn't work in python 3

With python 3.9 you can use an |= update operator:
>>> d = {'b': 4}
>>> d |= {'a': 2}
>>> d
{'a': 2, 'b': 4}

Here's a more elegant solution, compared to Eric's 2nd snippet
>>> a = {'a' : 1, 'b' : 2}
>>> b = {'a' : 2, 'c' : 3}
>>> c = dict(a, **b)
>>> a
{'a': 1, 'b': 2}
>>> b
{'a': 2, 'c': 3}
>>> c
{'a': 2, 'b': 2, 'c': 3}
It works both in Python 2 and 3
And of course, the update method
>>> a
{'a': 1, 'b': 2}
>>> b
{'a': 2, 'c': 3}
>>> a.update(b)
>>> a
{'a': 2, 'b': 2, 'c': 3}
However, be careful with the latter, as might cause you issues in case of misuse like here
>>> a = {'a' : 1, 'b' : 2}
>>> b = {'a' : 2, 'c' : 3}
>>> c = a
>>> c.update(b)
>>> a
{'a': 2, 'b': 2, 'c': 3}
>>> b
{'a': 2, 'c': 3}
>>> c
{'a': 2, 'b': 2, 'c': 3}

The new version of Python3.9 introduces two new operators for dictionaries: union (|) and in-place union (|=). You can use | to merge two dictionaries, while |= will update a dictionary in place. Let's consider 2 dictionaries d1 and d2
d1 = {"name": "Arun", "height": 170}
d2 = {"age": 21, "height": 170}
d3 = d1 | d2 # d3 is the union of d1 and d2
print(d3)
Output:
{'name': 'Arun', 'height': 170, 'age': 21}
Update d1 with d2
d1 |= d2
print(d1)
Output:
{'name': 'Arun', 'height': 170, 'age': 21}
You can update d1 with a new key weight as
d1 |= {"weight": 80}
print(d1)
Output:
{'name': 'Arun', 'height': 170, 'age': 21, 'weight': 80}

So if I understand you correctly you want to return a list of of tuples with (key, old_value) for the keys that were replaced.
You have to save the old value before you replace it:
def add_to_dict(d, key_value_pairs):
newinputs = [] #creates new list
for key, value in key_value_pairs:
if key in d:
newinputs.append((key, d[key]))
d[key] = value #updates element of key with value
return newinputs

Each key in a python dict corresponds to exactly one value. The cases where d and key_value_pairs have different keys are not the same elements.
Is newinputs supposed to contain the key/value pairs that were previously not present in d? If so:
def add_to_dict(d, key_value_pairs):
newinputs = []
for key, value in key_value_pairs:
if key not in d:
newinputs.append((key, value))
d[key] = value
return newinputs
Is newinputs supposed to contain the key/value pairs where the key was present in d and then changed? If so:
def add_to_dict(d, key_value_pairs):
newinputs = []
for key, value in key_value_pairs:
if key in d:
newinputs.append((key, value))
d[key] = value
return newinputs

If I understand you correctly, you only want to add the keys that do not exist in the dictionary. Here is the code:
def add_to_dict(d, key_value_pairs):
newinputs = [];
for key, value in key_value_pairs:
if key not in d.keys():
d[key] = value
newinputs.append((key, value));
return newinputs
For each key in new key,value pairs list you have to check if the key is new to the dictionary and add it only then.
Hope it helps ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregate List of Map in PySpark - python

I have a list of map e.g [{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20} } I want to get the average of values of a and b. So the expected output is a = (10 + 5 + 0 + 0) /3 = 5 ; b = 80/4 = 20. How can i do it efficiently using RDD

Related

How to get dictionary of df indices that links the same ids on different days?

Write struct columns to parquet with pyarrow

Dataframe to Dictionary including List of dictionaries

Add values from two dictionaries

Updating a dictionary in python

Categories

Resources