Group documents based on dynamic keys and convert keys to values

Group documents based on dynamic keys and convert keys to values - python

I have a data in MongoDB, the data like that:
{
"good": {
"d1": 2,
"d2": 56,
"d3": 3
},
"school": {
"d1": 4,
"d3": 5,
"d4": 12
}
},
{
"good": {
"d5": 4,
"d6": 5
},
"spark": {
"d5": 6,
"d6": 11,
"d7": 10
},
"school": {
"d5": 8,
"d8": 7
}
}
and I want to use pymongo mapreduce to generate data like this:
{
'word': 'good',
'info': [
{
'tbl_id': 'd1',
'term_freq': 2
},
{
'tbl_id': 'd2',
'term_freq': 56
},
{
'tbl_id': 'd3',
'term_freq': 3
},
{
'tbl_id': 'd5',
'term_freq': 4
},
{
'tbl_id': 'd6',
'term_freq': 5
}
]
}
{
'word': 'school',
'info': [
{
'tbl_id': 'd1',
'term_freq': 4
},
{
'tbl_id': 'd3',
'term_freq': 5
},
{
'tbl_id': 'd4',
'term_freq': 12
},
{
'tbl_id': 'd5',
'term_freq': 8
},
{
'tbl_id': 'd8',
'term_freq': 7
}
]
}
{
'word': 'spark',
'info': [
{
'tbl_id': 'd5',
'term_freq': 6
},
{
'tbl_id': 'd6',
'term_freq': 11
},
{
'tbl_id': 'd7',
'term_freq': 10
}
]
}
what should I do? Or there is other solutions?

You don't need *mapReduce` here. The aggregation framework can handle do this beautifully.
As of how this works, I suggest you have a look at each operator in the documentation.
_filter = {
"input": {"$objectToArray": "$$ROOT"},
"cond": {"$ne": ["$$this.k", "_id"]}
}
_map = {
"$map": {
"input": {"$filter": _filter},
"in": {
"k": "$$this.k",
"info": {
"$map": {
"input": {"$objectToArray": "$$this.v"},
"in": {"tbl_id": "$$this.k", "freq_term": "$$this.v"}
}
}
}
}
}
pipeline = [
{"$project": {"word": _map}},
{"$unwind": "$word"},
{
"$group": {
"_id": "$word.k",
"info": {
"$push": "$word.info"
}
}
},
{
"$project": {
"_id": 0,
"word": "$_id",
"info": {
"$reduce": {
"input": "$info",
"initialValue": [
],
"in": {
"$concatArrays": [
"$$value",
"$$this"
]
}
}
}
}
}
]
Then run with the .aggregate() method.
collection.aggregate(pipeline)

Related

Returning data that is not in ElasticSearch as 0 in doc_count

I am filtering in ElasticSearch. I want doc_count to return 0 on non-data dates, but it doesn't print those dates at all, only dates with data are returned to me. do you know how i can do it? Here is the Python output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
33479 {'date': '2022-04-13T08:08:00.000Z', 'value': 7}
33480 {'date': '2022-04-13T08:08:00.000Z', 'value': 7}
33481 {'date': '2022-04-13T08:08:00.000Z', 'value': 7}
33482 {'date': '2022-04-13T08:08:00.000Z', 'value': 7}
33483 {'date': '2022-04-13T08:08:00.000Z', 'value': 7}
And here is my ElasticSearch filter:
"from": 0,
"size": 0,
"query": {
"bool": {
"must":
[
{
"range": {
"#timestamp": {
"gte": "now-1M",
"lt": "now"
}
}
}
]
}
},
"aggs": {
"continent": {
"terms": {
"field": "source.geo.continent_name.keyword"
},
"aggs": {
"_source": {
"date_histogram": {
"field": "#timestamp", "interval": "8m"
}}}}}}

You need to set min_doc_count value to 0 for aggregation where you want result with zero doc_count.
{
"from": 0,
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "now-1M",
"lt": "now"
}
}
}
]
}
},
"aggs": {
"continent": {
"terms": {
"field": "source.geo.continent_name.keyword",
"min_doc_count": 0
},
"aggs": {
"_source": {
"date_histogram": {
"field": "#timestamp",
"interval": "8m",
"min_doc_count": 0
}
}
}
}
}
}

update_many objects by reference to objects in same documents

Let's say I have a collection like the following. For every document that contains animals.horse, I want to set animals.goat equal to animals.horse (so the horses don't get lonely or outnumbered).
[
{
"_id": 1,
"animals": {
"goat": 1
}
},
{
"_id": 2,
"animals": {
"cow": 1,
"horse": 2,
"goat": 1
}
},
{
"_id": 3,
"animals": {
"horse": 5
}
},
{
"_id": 4,
"animals": {
"cow": 1
}
}
]
In Mongo shell, this works as desired:
db.collection.update(
{"animals.horse": { "$gt": 0 }},
[ { "$set": { "animals.goat": "$animals.horse" } } ],
{ "multi": true }
)
which achieves the desired result:
[
{
"_id": 1,
"animals": {
"goat": 1
}
},
{
"_id": 2,
"animals": {
"cow": 1,
"goat": 2,
"horse": 2
}
},
{
"_id": 3,
"animals": {
"goat": 5,
"horse": 5
}
},
{
"_id": 4,
"animals": {
"cow": 1
}
}
]
However, this doesn't work in pymongo -- the collection is unaltered.
db.collection.update_many( filter = {'animals.horse': {'$gt':0} },
update = [ {'$set': {'animals.goat': '$animals.horse' } } ],
upsert = True
)
What am I doing wrong?

Is there an efficient way to compare each key, value pair of a dictionary in a many to one comparison

Idea is to compare N number of dictionaries with a single standard dictionary where each key, value pair comparison has a different conditional rule.
Eg.,
Standard dictionary -
{'ram': 16,
'storage': [512, 1, 2],
'manufacturers': ['Dell', 'Apple', 'Asus', 'Alienware'],
'year': 2018,
'drives': ['A', 'B', 'C', 'D', 'E']
}
List of dictionaries -
{'ram': 8,
'storage': 1,
'manufacturers': 'Apple',
'year': 2018,
'drives': ['C', 'D', 'E']
},
{'ram': 16,
'storage': 4,
'manufacturers': 'Asus',
'year': 2021,
'drives': ['F', 'G','H']
},
{'ram': 4,
'storage': 2,
'manufacturers': 'ACER',
'year': 2016,
'drives': ['F', 'G', 'H']
}
Conditions-
'ram' > 8
if 'ram' >=8 then 'storage' >= 2 else 1
'manufactures' in ['Dell', 'Apple', 'Asus', 'Alienware']
'year' >= 2018
if 'year' > 2018 then 'drives' in ['A', 'B', 'C', 'D', 'E'] else ['F', 'G', 'H']
So the expected output is to display all the non-matching ones with non-matching values and none/null for the matching values.
Expected Output -
{'ram': 8,
'storage': 1,
'manufacturers': None,
'year': None,
'drives': ['C', 'D', 'E']
},
{'ram': None,
'storage': None,
'manufacturers': None,
'year': None,
'drives': ['F','G','H']
},
{'ram': 4,
'storage': 2,
'manufacturers': 'ACER',
'year': 2016,
'drives': None
}
While working with MongoDB I encountered this problem where each document in a data collection should be compared with a standard collection. Any MongoDB direct query would also be very helpful.

To achieve the conditions along using MongoDB Aggregation, use the below Query:
db.collection.aggregate([
{
"$project": {
"ram": {
"$cond": {
"if": {
"$gt": [
"$ram",
8
]
},
"then": null,
"else": "$ram",
}
},
"storage": {
"$cond": {
"if": {
"$and": [
{
"$gte": [
"$ram",
8
]
},
{
"$gte": [
"$storage",
2
]
},
],
},
"then": null,
"else": "$storage",
}
},
"manufacturers": {
"$cond": {
"if": {
"$in": [
"$manufacturers",
[
"Dell",
"Apple",
"Asus",
"Alienware"
],
]
},
"then": null,
"else": "$manufacturers",
}
},
"year": {
"$cond": {
"if": {
"$gte": [
"$year",
2018
]
},
"then": null,
"else": "$year",
}
},
"drives": {
"$cond": {
"if": {
"$gt": [
"$year",
2018
]
},
"then": {
"$setIntersection": [
"$drives",
[
"A",
"B",
"C",
"D",
"E"
]
]
},
"else": "$drives",
}
},
}
}
])
Mongo Playground Sample Execution
You can combine this with for loop in Python
for std_doc in std_col.find({}, {
"ram": 1,
"storage": 1,
"manufacturers": 1,
"year": 1,
"drives": 1,
}):
print(list(list_col.aggregate([
{
"$project": {
"ram": {
"$cond": {
"if": {
"$gt": [
"$ram",
8
]
},
"then": None,
"else": "$ram",
}
},
"storage": {
"$cond": {
"if": {
"$and": [
{
"$gte": [
"$ram",
8
]
},
{
"$gte": [
"$storage",
2
]
},
],
},
"then": None,
"else": "$storage",
}
},
"manufacturers": {
"$cond": {
"if": {
"$in": [
"$manufacturers",
[
"Dell",
"Apple",
"Asus",
"Alienware"
],
]
},
"then": None,
"else": "$manufacturers",
}
},
"year": {
"$cond": {
"if": {
"$gte": [
"$year",
2018
]
},
"then": None,
"else": "$year",
}
},
"drives": {
"$cond": {
"if": {
"$gt": [
"$year",
2018
]
},
"then": {
"$setIntersection": [
"$drives",
[
"A",
"B",
"C",
"D",
"E"
]
]
},
"else": "$drives",
}
},
}
}
])))
The most optimized solution is to perform a lookup, but this varies based on your requirement:
db.std_col.aggregate([
{
"$lookup": {
"from": "dict_col",
"let": {
"cmpRam": "$ram",
"cmpStorage": "$storage",
"cmpManufacturers": "$manufacturers",
"cmpYear": "$year",
"cmpDrives": "$drives",
},
"pipeline": [
{
"$project": {
"ram": {
"$cond": {
"if": {
"$gt": [
"$ram",
"$$cmpRam",
]
},
"then": null,
"else": "$ram",
}
},
"storage": {
"$cond": {
"if": {
"$and": [
{
"$gte": [
"$ram",
"$$cmpRam"
]
},
{
"$gte": [
"$storage",
"$$cmpStorage"
]
},
],
},
"then": null,
"else": "$storage",
}
},
"manufacturers": {
"$cond": {
"if": {
"$in": [
"$manufacturers",
"$$cmpManufacturers",
]
},
"then": null,
"else": "$manufacturers",
}
},
"year": {
"$cond": {
"if": {
"$gte": [
"$year",
"$$cmpYear",
]
},
"then": null,
"else": "$year",
}
},
"drives": {
"$cond": {
"if": {
"$gt": [
"$year",
"$$cmpYear"
]
},
"then": {
"$setIntersection": [
"$drives",
"$$cmpDrives"
]
},
"else": "$drives",
}
},
}
},
],
"as": "inventory_docs"
}
}
])
Mongo Playground Sample Execution

Remove JSON object from JSON?

I have a json array as
{
"query": {
"bool": {
"must": [],
"should": [
{
"match": {
"Name": {
"query": "Nametest",
"fuzziness": 3,
"boost": 5
}
}
},
{
"match": {
"Address": {
"query": "NONE",
"fuzziness": 3,
"boost": 4
}
}
},
{
"match": {
"Site": {
"query": "Adeswfvfv",
"fuzziness": 3,
"boost": 4
}
}
},
{
"match": {
"Phone": {
"query": "5680728.00",
"fuzziness": 2,
"boost": 4
}
}
}
],
"minimum_should_match": 2
}
}
}
So What i wanna do is if In json['query']['bool']['should'] if "query" is "NONE" then I wanna remove that json array and the new json will be
{
"query": {
"bool": {
"must": [],
"should": [
{
"match": {
"Name": {
"query": "Nametest",
"fuzziness": 3,
"boost": 5
}
}
},
{
"match": {
"Site": {
"query": "Adeswfvfv",
"fuzziness": 3,
"boost": 4
}
}
},
{
"match": {
"Phone": {
"query": "5680728.00",
"fuzziness": 2,
"boost": 4
}
}
}
],
"minimum_should_match": 2
}
}
}
I have tried iterating over the json and used del(jsonarray) and pop(jsonarray) but nothing seeems to help out?
tried with python json library but failed
for e in q['query']['bool']['should']:
... if "NONE" in str(e['match']):
... del(e)

This should help.
import pprint
d = {'query': {'bool': {'minimum_should_match': 2, 'should': [{'match': {'Name': {'query': 'Nametest', 'boost': 5, 'fuzziness': 3}}}, {'match': {'Address': {'query': 'NONE', 'boost': 4, 'fuzziness': 3}}}, {'match': {'Site': {'query': 'Adeswfvfv', 'boost': 4, 'fuzziness': 3}}}, {'match': {'Phone': {'query': '5680728.00', 'boost': 4, 'fuzziness': 2}}}], 'must': []}}}
d["query"]['bool']['should'] = [i for i in d["query"]['bool']['should'] if list(i['match'].items())[0][1]["query"] != 'NONE']
pprint.pprint(d)
Output:
{'query': {'bool': {'minimum_should_match': 2,
'must': [],
'should': [{'match': {'Name': {'boost': 5,
'fuzziness': 3,
'query': 'Nametest'}}},
{'match': {'Site': {'boost': 4,
'fuzziness': 3,
'query': 'Adeswfvfv'}}},
{'match': {'Phone': {'boost': 4,
'fuzziness': 2,
'query': '5680728.00'}}}]}}}

I write this,but this seems complex
for p,c in enumerate(json['query']['bool']['should']):
if list(c["match"].values())[0]["query"] == "NONE":
json['query']['bool']['should'].pop(p)
print(json)

Pandas Dataframe to JSON Hierarchy

I have exhaustively reviewed/attempted implementations all the other questions on SO corresponding to this challenge and have yet to reach a solution.
Question: how do I convert employee and supervisor pairs into a hierarchical JSON structure to be used for a D3 visualization? There are an unknown number of levels, so it has to be dynamic.
I have a dataframe with five columns (yes, I realize this isn't the actual hierarchy of The Office):
Employee_FN Employee_LN Supervisor_FN Supervisor_LN Level
0 Michael Scott None None 0
1 Jim Halpert Michael Scott 1
2 Dwight Schrute Michael Scott 1
3 Stanley Hudson Jim Halpert 2
4 Pam Beasley Jim Halpert 2
5 Ryan Howard Pam Beasley 3
6 Kelly Kapoor Ryan Howard 4
7 Meredith Palmer Ryan Howard 4
Desired Output Snapshot:
{
"Employee_FN": "Michael",
"Employee_LN": "Scott",
"Level": "0",
"Reports": [{
"Employee_FN": "Jim",
"Employee_LN": "Halpert",
"Level": "1",
"Reports": [{
"Employee_FN": "Stanley",
"Employee_LN": "Hudson",
"Level": "2",
}, {
"Employee_FN": "Pam",
"Employee_LN": "Beasley",
"Level": "2",
}]
}]
}
Current State:
j = (df.groupby(['Level','Employee_FN','Employee_LN'], as_index=False)
.apply(lambda x: x[['Level','Employee_FN','Employee_LN']].to_dict('r'))
.reset_index()
.rename(columns={0:'Reports'})
.to_json(orient='records'))
print(json.dumps(json.loads(j), indent=2, sort_keys=True))
Current Output:
[
{
"Employee_FN": "Michael",
"Employee_LN": "Scott",
"Level": 0,
"Reports": [
{
"Employee_FN": "Michael",
"Employee_LN": "Scott",
"Level": 0
}
]
},
{
"Employee_FN": "Dwight",
"Employee_LN": "Schrute",
"Level": 1,
"Reports": [
{
"Employee_FN": "Dwight",
"Employee_LN": "Schrute",
"Level": 1
}
]
},
{
"Employee_FN": "Jim",
"Employee_LN": "Halpert",
"Level": 1,
"Reports": [
{
"Employee_FN": "Jim",
"Employee_LN": "Halpert",
"Level": 1
}
]
},
{
"Employee_FN": "Pam",
"Employee_LN": "Beasley",
"Level": 2,
"Reports": [
{
"Employee_FN": "Pam",
"Employee_LN": "Beasley",
"Level": 2
}
]
},
{
"Employee_FN": "Stanley",
"Employee_LN": "Hudson",
"Level": 2,
"Reports": [
{
"Employee_FN": "Stanley",
"Employee_LN": "Hudson",
"Level": 2
}
]
},
{
"Employee_FN": "Ryan",
"Employee_LN": "Howard",
"Level": 3,
"Reports": [
{
"Employee_FN": "Ryan",
"Employee_LN": "Howard",
"Level": 3
}
]
},
{
"Employee_FN": "Kelly",
"Employee_LN": "Kapoor",
"Level": 4,
"Reports": [
{
"Employee_FN": "Kelly",
"Employee_LN": "Kapoor",
"Level": 4
}
]
},
{
"Employee_FN": "Meredith",
"Employee_LN": "Palmer",
"Level": 4,
"Reports": [
{
"Employee_FN": "Meredith",
"Employee_LN": "Palmer",
"Level": 4
}
]
}
]
Problems:
Each person only has themselves as children
The whole JSON structure appears to be in a dict - I believe it has to be enclosed by {} to be readable
I have tried switched around the groupby and lambda elements in various configurations to reach the desired output as well. Any and all insight would be greatly appreciated! Thank you!
Update:
I changed my code block to this:
j = (df.groupby(['Level','Supervisor_FN','Supervisor_LN'], as_index=False)
.apply(lambda x: x[['Level','Employee_FN','Employee_LN']].to_dict('r'))
.reset_index()
.rename(columns={0:'Reports'})
.rename(columns={'Supervisor_FN':'Employee_FN'})
.rename(columns={'Supervisor_LN':'Employee_LN'})
.to_json(orient='records'))
print(json.dumps(json.loads(j), indent=2, sort_keys=True))
The new output is this:
[
{
"Employee_FN": "Michael",
"Employee_LN": "Scott",
"Level": 1,
"Reports": [
{
"Employee_FN": "Jim",
"Employee_LN": "Halpert",
"Level": 1
},
{
"Employee_FN": "Dwight",
"Employee_LN": "Schrute",
"Level": 1
}
]
},
{
"Employee_FN": "Jim",
"Employee_LN": "Halpert",
"Level": 2,
"Reports": [
{
"Employee_FN": "Stanley",
"Employee_LN": "Hudson",
"Level": 2
},
{
"Employee_FN": "Pam",
"Employee_LN": "Beasley",
"Level": 2
}
]
},
{
"Employee_FN": "Pam",
"Employee_LN": "Beasley",
"Level": 3,
"Reports": [
{
"Employee_FN": "Ryan",
"Employee_LN": "Howard",
"Level": 3
}
]
},
{
"Employee_FN": "Ryan",
"Employee_LN": "Howard",
"Level": 4,
"Reports": [
{
"Employee_FN": "Kelly",
"Employee_LN": "Kapoor",
"Level": 4
},
{
"Employee_FN": "Meredith",
"Employee_LN": "Palmer",
"Level": 4
}
]
}
]
Problems:
The Level matches the underlying employee for both the underlying employee and the supervisor
The nesting only goes one level deep

This type of problem isn't particularly well-suited for Pandas; the data structure you're going after is recursive, not tabular.
Here is one possible solution.
from operator import itemgetter
employee_key = itemgetter('Employee_FN', 'Employee_LN')
supervisor_key = itemgetter('Supervisor_FN', 'Supervisor_LN')
def subset(dict_, keys):
return {k: dict_[k] for k in keys}
# store employee references
cache = {}
# iterate over employees sorted by level, so supervisors are cached before reports
for row in df.sort_values('Level').to_dict('records'):
# look up employee/supervisor references
employee = cache.setdefault(employee_key(row), subset(row, keys=('Employee_FN', 'Employee_LN', 'Level')))
supervisor = cache.get(supervisor_key(row), {})
# link reports to employee
supervisor.setdefault('Reports', []).append(employee)
# grab only top-level employees
[rec for key, rec in cache.iteritems() if rec['Level'] == 0]
[{'Employee_FN': 'Michael',
'Employee_LN': 'Scott',
'Level': 0,
'Reports': [{'Employee_FN': 'Jim',
'Employee_LN': 'Halpert',
'Level': 1,
'Reports': [{'Employee_FN': 'Stanley',
'Employee_LN': 'Hudson',
'Level': 2},
{'Employee_FN': 'Pam',
'Employee_LN': 'Beasley',
'Level': 2,
'Reports': [{'Employee_FN': 'Ryan',
'Employee_LN': 'Howard',
'Level': 3,
'Reports': [{'Employee_FN': 'Kelly',
'Employee_LN': 'Kapoor',
'Level': 4},
{'Employee_FN': 'Meredith',
'Employee_LN': 'Palmer',
'Level': 4}]}]}]},
{'Employee_FN': 'Dwight', 'Employee_LN': 'Schrute', 'Level': 1}]}]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group documents based on dynamic keys and convert keys to values - python

Related

Returning data that is not in ElasticSearch as 0 in doc_count

update_many objects by reference to objects in same documents

Is there an efficient way to compare each key, value pair of a dictionary in a many to one comparison

Remove JSON object from JSON?

Pandas Dataframe to JSON Hierarchy

Categories

Resources