Count with MongoDB aggregate $group result - python

I have a database query with pymongo , like so
pipeline = [
{"$group": {"_id": "$product", "count": {"$sum": 1}}},
]
rows = list(collection_name.aggregate(pipeline))
print(rows)
The result like to
[
{'_id': 'p1', 'count': 45},
{'_id': 'p2', 'count': 4},
{'_id': 'p3', 'count': 96},
{'_id': 'p1', 'count': 23},
{'_id': 'p4', 'count': 10}
]
Objective:
On the basis of the above results, I want to conduct statistics between partitions. For example, To get the number of count times in the following intervals:
partition, count
(0, 10], 2
[11, 50), 2
[50, 100], 1
Is there a way of doing this entirely using MongoDB aggregate framework?
Any comments would be very helpful. Thanks.
Answer by #Wernfried Domscheit
$bucket
pipeline = [
{"$group": {"_id": "$product", "count": {"$sum": 1}}},
{"$bucket": {
"groupBy": "$count",
"boundaries": [0, 11, 51, 100],
"default": "Other",
"output": {
"count": {"$sum": 1},
}
}}
]
rows = list(tbl_athletes.aggregate(pipeline))
rows
$bucketAuto
pipeline = [
{"$group": {"_id": "$product", "count": {"$sum": 1}}},
{"$bucketAuto": {
"groupBy": "$count",
"buckets": 5,
"output": {
"count": {"$sum": 1},
}
}}
]
rows = list(tbl_athletes.aggregate(pipeline))
rows
NOTICE:
In $bucket,default must be there.

Yes, you have the $bucket operator for that:
db.collection.aggregate([
{
$bucket: {
groupBy: "$count",
boundaries: [0, 11, 51, 100],
output: {
count: { $sum: 1 },
}
}
}
])
Or use $bucketAuto where the intervals are generated automatically.

Related

How to remove doubles by nested attributes in Python?

I've got a list of records in which the details contain some doubles. In the list of dicts below you see that the first 3 records (with id 1, 2 and 3) have the same "count" for all the details with a dir "s" (even though their respective detail id's differ). I would like to remove all records from the root list, for which all the counts of the details with a dir "s" are the same as the counts of the details with a dir "s" in a previous record. So from the list below I would want the records with ids 2 and 3 to be removed from the records list.
I've been writing nested loops for a while, but I can't really find a way of doing this. Plus, my code constantly becomes this complete mess real quick.
What would be a logical and Pythonic way of doing this?
records = [
{
'id': 1,
'details': [
{"id": 10, "dir": "s", "count": "1"},
{"id": 20, "dir": "u", "count": "6"},
{"id": 30, "dir": "s", "count": "1"}
]
},
{
'id': 2,
'details': [
{"id": 40, "dir": "s", "count": "1"},
{"id": 50, "dir": "u", "count": "7"},
{"id": 60, "dir": "s", "count": "1"}
]
},
{
'id': 3,
'details': [
{"id": 70, "dir": "s", "count": "1"},
{"id": 80, "dir": "u", "count": "8"},
{"id": 90, "dir": "s", "count": "1"}
]
},
{
'id': 4,
'details': [
{"id": 100, "dir": "s", "count": "999"},
{"id": 110, "dir": "up", "count": "6"},
{"id": 120, "dir": "s", "count": "999"}
]
},
]
Use a set and the key based on the two elements of the dict that you consider the definition of a 'duplicate'.
Simple example to uniquify:
seen=set()
for di in records:
for sdi in di['details']:
key=(sdi['dir'], sdi['count'])
if key not in seen:
seen.add(key)
print(sdi)
else:
# deal with the duplicate?
pass
Prints:
{'id': 10, 'dir': 's', 'count': '1'}
{'id': 20, 'dir': 'u', 'count': '6'}
{'id': 50, 'dir': 'u', 'count': '7'}
{'id': 80, 'dir': 'u', 'count': '8'}
{'id': 100, 'dir': 's', 'count': '999'}
{'id': 110, 'dir': 'up', 'count': '6'}
Giving a first pass the what I think you mean:
seen=set()
new_rec=[]
for di in records:
new_di={}
new_di['id']=di['id']
new_li=[]
for sdi in di['details']:
key=(sdi['dir'], sdi['count'])
if key not in seen:
seen.add(key)
new_li.append(sdi)
else:
# deal with the duplicate?
pass
new_di['details']=new_li
new_rec.append(new_di)
Which results in:
[ { 'id': 1,
'details': [ { 'id': 10,
'dir': 's',
'count': '1'},
{ 'id': 20,
'dir': 'u',
'count': '6'}]},
{ 'id': 2,
'details': [ { 'id': 50,
'dir': 'u',
'count': '7'}]},
{ 'id': 3,
'details': [ { 'id': 80,
'dir': 'u',
'count': '8'}]},
{ 'id': 4,
'details': [ { 'id': 100,
'dir': 's',
'count': '999'},
{ 'id': 110,
'dir': 'up',
'count': '6'}]}]

Merge 2 lists and remove duplicates in Python

I have 2 lists, looking like:
temp_data:
{
"id": 1,
"name": "test (replaced)",
"code": "test",
"last_update": "2020-01-01",
"online": false,
"data": {
"temperature": [
{
"date": "2019-12-17",
"value": 23.652905748126333
},
...
]}
hum_data:
{
"id": 1,
"name": "test (replaced)",
"code": "test",
"last_update": "2020-01-01",
"online": false,
"data": {
"humidity": [
{
"date": "2019-12-17",
"value": 23.652905748126333
},
...
]}
I need to merge the 2 lists to 1 without duplicating data. What is the easiest/efficient way? After merging, I want something like this:
{
"id": 1,
"name": "test",
"code": "test",
"last_update": "2020-01-01",
"online": false,
"data": {
"temperature": [
{
"date": "2019-12-17",
"value": 23.652905748126333
},
...
],
"humidity": [
{
"date": "2019-12-17",
"value": 23.652905748126333
},
...
Thanks for helping.
If your lists hum_data and temp_data are not sorted then first sort them and then concatenate the dictionaries pair-wise.
# To make comparisons for sorting
compare_function = lambda value : value['id']
# sort arrays before to make later concatenation easier
temp_data.sort(key=compare_function)
hum_data.sort(key=compare_function)
combined_data = temp_data.copy()
# concatenate the dictionries using the update function
for hum_row, combined_row in zip(hum_data, combined_data):
combined_row['data'].update(hum_row['data'])
# combined hum_data and temp_data
combined_data
If the lists are already sorted then you just need to concatenate dictionary by dictionary.
combined_data = temp_data.copy()
# concatenate the dictionries using the update function
for hum_row, combined_row in zip(hum_data, combined_data):
combined_row['data'].update(hum_row['data'])
# combined hum_data and temp_data
combined_data
With that code I got the following result:
[
{
'id': 1,
'name': 'test (replaced)',
'code': 'test',
'last_update': '2020-01-01',
'online': False,
'data': {
'temperature': [{'date': '2019-12-17', 'value': 1}],
'humidity': [{'date': '2019-12-17', 'value': 1}]}
},
{
'id': 2,
'name': 'test (replaced)',
'code': 'test',
'last_update': '2020-01-01',
'online': False,
'data': {
'temperature': [{'date': '2019-12-17', 'value': 2}],
'humidity': [{'date': '2019-12-17', 'value': 2}]}
}
]

Python/PySpark parse JSON string with numbered attributes

I need to store JSON strings like the one below in some file format different from plaintext (e.g: parquet):
{
"vidName": "Foo",
"vidInfo.0.size.length": 10,
"vidInfo.0.size.width": 10,
"vidInfo.0.quality": "Good",
"vidInfo.1.size.length": 7,
"vidInfo.1.size.width": 3,
"vidInfo.1.quality": "Bad",
"vidInfo.2.size.length": 10,
"vidInfo.2.size.width": 2,
"vidInfo.2.quality": "Excelent"
}
There's no known bound for the index of vidInfo (can be 10, 20). Thus I want either to have vidInfos in an array, or explode such JSON object into multiple smaller objects.
I found this question: PHP JSON parsing (number attributes?)
But it is in PHP which I do not quite understand. And I am not sure whether it is same as what I need.
The intermediate data should be something like this:
{
"vidName": "Foo",
"vidInfo": [
{
"id": 0,
"size": {
"length": 10,
"width": 10
},
"quality": "Good"
},
{
"id": 1,
"size": {
"length": 7,
"width": 3
},
"quality": "Bad"
},
{
"id": 2,
"size": {
"length": 10,
"width": 2
},
"quality": "Excelent"
}
]
}
or like this:
{
"vidName": "Foo",
"vidInfo": [
{
"size": {
"length": 10,
"width": 10
},
"quality": "Good"
},
{
"size": {
"length": 7,
"width": 3
},
"quality": "Bad"
},
{
"size": {
"length": 10,
"width": 2
},
"quality": "Excelent"
}
]
}
I am stuck, and would need some hints to move on.
Could you please help?
Thanks a lot for your help.
I found this library https://github.com/amirziai/flatten which does the trick.
In [154]: some_json = {
...: "vidName": "Foo",
...: "vidInfo.0.size.length": 10,
...: "vidInfo.0.size.width": 10,
...: "vidInfo.0.quality": "Good",
...: "vidInfo.1.size.length": 7,
...: "vidInfo.1.size.width": 3,
...: "vidInfo.1.quality": "Bad",
...: "vidInfo.2.size.length": 10,
...: "vidInfo.2.size.width": 2,
...: "vidInfo.2.quality": "Excelent"
...: }
In [155]: some_json
Out[155]:
{'vidName': 'Foo',
'vidInfo.0.size.length': 10,
'vidInfo.0.size.width': 10,
'vidInfo.0.quality': 'Good',
'vidInfo.1.size.length': 7,
'vidInfo.1.size.width': 3,
'vidInfo.1.quality': 'Bad',
'vidInfo.2.size.length': 10,
'vidInfo.2.size.width': 2,
'vidInfo.2.quality': 'Excelent'}
In [156]: from flatten_json import unflatten_list
...: import json
...: nested_json = unflatten_list(json.loads(json.dumps(some_json)), '.')
In [157]: nested_json
Out[157]:
{'vidInfo': [{'quality': 'Good', 'size': {'length': 10, 'width': 10}},
{'quality': 'Bad', 'size': {'length': 7, 'width': 3}},
{'quality': 'Excelent', 'size': {'length': 10, 'width': 2}}],
'vidName': 'Foo'}

Group and sum list of dictionaries by parameter

I have a list of dictionaries of my products (drinks, food, etc), some of the products may be added several times. I need to group my products by product_id parameter and sum product_cost and product_quantity of each group to get the total product price.
I'm a newbie in python, understand how to group list of dictionaries but can't figure out how to sum some parameter values.
"products_list": [
{
"product_cost": 25,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 14,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
},
{
"product_cost": 13,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 7,
}
]
I need to achieve something like that:
"products_list": [
{
"product_cost": 38,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 21,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
}
]
You can start by sorting the list of dictionaries on product_name, and then group items based on product_name
Then for each group, calculate the total product and total quantity, create your final dictionary and update to the list, and then make your final dictionary
from itertools import groupby
dct = {"products_list": [
{
"product_cost": 25,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 14,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
},
{
"product_cost": 13,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 7,
}
]}
result = {}
li = []
#Sort product list on product_name
sorted_prod_list = sorted(dct['products_list'], key=lambda x:x['product_name'])
#Group on product_name
for model, group in groupby(sorted_prod_list,key=lambda x:x['product_name']):
grp = list(group)
#Compute total cost and qty, make the dictionary and add to list
total_cost = sum(item['product_cost'] for item in grp)
total_qty = sum(item['product_quantity'] for item in grp)
product_name = grp[0]['product_name']
product_id = grp[0]['product_id']
li.append({'product_name': product_name, 'product_id': product_id, 'product_cost': total_cost, 'product_quantity': total_qty})
#Make final dictionary
result['products_list'] = li
print(result)
The output will be
{
'products_list': [{
'product_name': 'Apples',
'product_id': 2,
'product_cost': 176.74,
'product_quantity': 800
},
{
'product_name': 'Coca-cola',
'product_id': 1,
'product_cost': 38,
'product_quantity': 21
}
]
}
You can try with pandas:
d = {"products_list": [
{
"product_cost": 25,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 14,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
},
{
"product_cost": 13,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 7,
}
]}
df=pd.DataFrame(d["products_list"])
Pass dict to pandas and perform groupby.
Then convert it back to dict with to_dict function.
result={}
result["products_list"]=df.groupby("product_name",as_index=False).sum().to_dict(orient="records")
Result:
{'products_list': [{'product_cost': 176.74,
'product_id': 2,
'product_name': 'Apples',
'product_quantity': 800},
{'product_cost': 38.0,
'product_id': 2,
'product_name': 'Coca-cola',
'product_quantity': 21}]}
Me personally I would reorganize it in to another dictionary by unique identifiers. Also, if you still need it in the list format you can still reorganize it in a dictionary, but you can just convert the dict.values() in to a list. Below is a function that does that.
def get_totals(product_dict):
totals = {}
for product in product_list["product_list"]:
if product["product_name"] not in totals:
totals[product["product_name"]] = product
else:
totals[product["product_name"]]["product_cost"] += product["product_cost"]
totals[product["product_name"]]["product_quantity"] += product["product_quantity"]
return list(totals.values())
output is:
[
{
'product_cost': 38,
'product_id': 1,
'product_name': 'Coca-cola',
'product_quantity': 21
},
{
'product_cost': 176.74,
'product_id': 2,
'product_name': 'Apples',
'product_quantity': 800
}
]
Now if you need it to belong to a product list key. Just reassign the list to the same key. Instead of returning list(total.values()) do
product_dict["product_list"] = list(total.values())
return product_dict
The output is a dictionary like:
{
"products_list": [
{
"product_cost": 38,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 21,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
}
]
}

Sum same object in many list in Python

I'm a beginner in Python. How to sum the same object id in many list Python? I have sample data of this.
data = [
[
{
'id': 1,
'count': 10
},
{
'id': 2,
'count': 20
},
],
[
{
'id': 1,
'count': 20
},
{
'id': 2,
'count': 30
},
]
]
How to sum count of same id, so I can get:
data = [
{
'id': 1,
'count': 30
},
{
'id': 2,
'count': 50
},
]
Try using pandas:
import pandas as pd
df = pd.DataFrame(sum(data, [])) # flatten the data
df = df.groupby('id').sum()
d = [{'id': index, 'count': row['count']} for index, row in df.iterrows()]
This isn't the optimal solution, but it works.
data = [
[
{
'id': 1,
'count': 10
},
{
'id': 2,
'count': 20
},
],
[
{
'id': 1,
'count': 20
},
{
'id': 2,
'count': 30
},
]
]
sumofdata = []
doneids = []
for i in data:
for j in i:
if j["id"] in doneids:
for d in sumofdata:
if d["id"] == j["id"]:
d["count"] += j["count"]
break
else:
doneids.append(j["id"])
sumofdata.append(j)
print(sumofdata)

Categories

Resources