MongoDB aggregation and group by id and then date - python

My audit_records collections is as below :
{u'policy_holder': u'Kapil', u'_id': ObjectId('4d663451d1e7242c4b68e000'), u'audit_time': datetime.datetime(2015, 9, 6, 10, 5, 12, 474000), u'policy_ids': [u'92b7bbfa-688e9e5304d5'], u'category': u'TIManagement'}
{u'policy_holder': u'Sunil', u'_id': ObjectId('4d6634514cb5cb2c4b69e000'), u'audit_time': datetime.datetime(2015, 9, 6, 11, 5, 12, 474000), u'policy_ids': [u'92b7bbfa-688e9e5304d5'], u'category': u'PIManagement'}
{u'policy_holder': u'Edward', u'_id': ObjectId('4d6634514cb5cb2c4b65e000'), u'audit_time': datetime.datetime(2015, 8, 3, 12, 4, 2, 723000), u'policy_ids': [u'92b7ccge-688e9e5304d5'], u'category': u'TIManagement'}
I'm querying my database using aggregation and pipeline to group by policy_ids and no of policy_holder associated with that policy_ids
and my code is as below:
startdate = datetime.datetime.strptime("2015-01-06",'%Y-%m-%d')
enddate = datetime.datetime.strptime("2015-10-01",'%Y-%m-%d')
pipe = [{'$match':{"audit_time": {"$gt": startdate,"$lte": enddate}}},{'$group': {'_id': '$policy_ids', 'policy_holder': {'$sum': 1}}}]
for data in db.audit_records.aggregate(pipeline=pipe):
Out got :
{u'policy_holder': 2, u'_id': u'92b7bbfa-688e9e5304d5'}
{u'policy_holder': 1, u'_id': u'92b7ccge-688e9e5304d5'}
Now want to group this whole output by date, is it possible and how?

you have to use the aggregation pipeline with $unwind with group
db.collection.aggregate([{$unwind:"$policy_ids"},{$group:{_id:{policy_id:"$policy_ids",audit_time:"$audit_time"},sum:{$sum:1}}}])
I Modify a bit in your document
Inserted the document like this
{'policy_holder': 'Kapil', '_id': ObjectId('4d663451d1e7242c4b68e000'), 'audit_time': new Date(2015, 9, 6, 10, 5, 12, 474000), 'policy_ids': ['92b7bbfa-688e9e5304d5'], 'category': 'TIManagement'}
{'policy_holder': 'Sunil', '_id': ObjectId('4d6634514cb5cb2c4b69e000'), 'audit_time': new Date(2015, 9, 6, 11, 5, 12, 474000), 'policy_ids': ['92b7bbfa-688e9e5304d5'], 'category': 'PIManagement'}
{'policy_holder': 'Edward', '_id': ObjectId('4d6634514cb5cb2c4b65e000'), 'audit_time': new Date(2015, 8, 3, 12, 4, 2, 723000), 'policy_ids': ['92b7ccge-688e9e5304d5'], 'category': 'TIManagement'}
Update the Aggregation Query
db.policy.aggregate([{$unwind:"$policy_ids"},{$group:{_id:{"policy":"$policy_ids",day: { $dayOfYear: "$audit_time"}, year: { $year: "$audit_time" }},total:{$sum:1}}}])
**
Output is
**
{ "_id" : { "policy" : "92b7ccge-688e9e5304d5", "day" : 246, "year" : 2015 }, "total" : 1 }
{ "_id" : { "policy" : "92b7bbfa-688e9e5304d5", "day" : 279, "year" : 2015 }, "total" : 2 }
Hope this is you are expecting

Related

Issues transforming tuple to denormalized dataframe

I have a tuple which is a list of 200 dicts:
eg:
mytuple= ([{'reviewId': '1234', 'userName': 'XXX', 'userImage': 'imagelink', 'content': 'AAA', 'score': 1, 'thumbsUpCount': 1, 'reviewCreatedVersion': '3.31.0', 'at': datetime.datetime(2022, 12, 1, 11, 49, 34), 'replyContent': "replycontent", 'repliedAt': datetime.datetime(2022, 12, 1, 12, 19, 51)},
{'reviewId': '5678', 'userName': 'S L', 'userImage': 'imagelink2', 'content': "content2", 'score': 1, 'thumbsUpCount': 0, 'reviewCreatedVersion': '3.31.0', 'at': datetime.datetime(2022, 11, 29, 12, 27, 46), 'replyContent': "replycontent2", 'repliedAt': datetime.datetime(2022, 11, 29, 12, 30, 40)}])
Ideally, I'd like to transform this into a dataframe with the following column headers:
reviewId
userName
userImage
1234
XXXX
imagelink
5678
S L
imagelink2
and so on with the column headers as the key and the columns containing the values.
mytuple was initially of size 2, from which I removed the second index and brought it down to just a list of dicts.
I tried different possibilities which include:
df=pd.DataFrame(mytuple)
df=pd.DataFrame.from_dict(mytuple)
df=pd.json_normalize(mytuple)
However, in all these cases, I get a dataframe as below
1
2
3
4
{'reviewId':..}
{'reviewId':..}
{}
{}
I'd like to understand where I'm going wrong. Thanks in advance!

Can the tuples be changed in rtc.datetime()?

import network, ntptime, time
from machine import RTC
# dictionary that maps string date names to indexes in the RTC's
datetime tuple
DATETIME_ELEMENTS = {
"year": 0,
"month": 1,
"day": 2,
"day_of_week": 3,
"hour": 4,
"minute": 5,
"second": 6,
"millisecond": 7
}
def connect_to_wifi(wlan, ssid, password):
if not wlan.isconnected():
print("Connecting to network...")
wlan.connect(ssid, password)
while not wlan.isconnected():
pass
# set an element of the RTC's datetime to a different value
def set_datetime_element(rtc, datetime_element, value):
date = list(rtc.datetime())
date[DATETIME_ELEMENTS[datetime_element]] = value
rtc.datetime(date)
wlan = network.WLAN(network.STA_IF)
wlan.active(True)
connect_to_wifi(wlan, "SSID", "Password")
rtc = RTC()
ntptime.settime()
set_datetime_element(rtc, "hour", 8) # I call this to change the hour to 8am for me
print(rtc.datetime()) # print the updated RTC time
Prints results:
(2022, 4, 28, 3, 18, 50, 27, 0)
(2022, 4, 28, 3, 8, 50, 27, 0)
I'm trying to get:
(2022, 4, 28, 8, 50, 27)
I don't want the day or microseconds. Any suggestions?
If you only want to print a subset of the fields in the tuple, you can use Python's slicing operating (see e.g. the examples here to select only those fields:
>>> now=(2022, 4, 28, 3, 18, 50, 27, 0)
>>> print(now)
(2022, 4, 28, 3, 18, 50, 27, 0)
>>> print(now[:3] + now[5:7])
(2022, 4, 28, 50, 27)

Filter an array of datetime given the start and end date in pymongo

I'm having a problem when i go to filter an array of dates using "$gte" and "$lte" on pymongo. I leave you a piece of code to better understand the problem.
import datetime
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')
db = client["AirQuality"]
demo = db["demo"]
demo.save({
"devId": 1,
"samples": [
{"value":3, "datetime":datetime.datetime(2021, 3, 4, 20, 15, 22)},
{"value":6, "datetime":datetime.datetime(2021, 3, 4, 22, 35, 12)},
{"value":2, "datetime":datetime.datetime(2021, 3, 6, 10, 15, 00)}
]
})
and I would like to filter the values for a particular range:
start = datetime.datetime(2021, 3, 4, 22, 00, 00)
end = datetime.datetime(2021, 3, 5, 2, 26, 49)
list(demo.find( { 'samples.datetime': { "$gte":start, "$lte":end } } ))
the output is as follows:
[{'_id': ObjectId('604353efad253df2602dfaf9'), 'devId': 1, 'samples': [{'value': 3, 'datetime': datetime.datetime(2021, 3, 4, 20, 15, 22)}, {'value': 6, 'datetime': datetime.datetime(2021, 3, 4, 22, 35, 12)}, {'value': 2, 'datetime': datetime.datetime(2021, 3, 6, 10, 15)}]}]
but I expect:
[{'_id': ObjectId('604353efad253df2602dfaf9'), 'devId': 1, 'samples': [{'value': 6, 'datetime': datetime.datetime(2021, 3, 4, 22, 35, 12)}]}]
Where am I doing wrong? Even if I apply a filter on "value" it doesn't work, so I believe the error is in the query! Thanks! 🙏
Solved with aggregation:
result = demo.aggregate([
{
"$project": {
"samples": {
"$filter": {
"input": "$samples",
"as": "item",
"cond": {
"$and":[
{ "$gte": [ "$$item.datetime", start ] },
{ "$lte": [ "$$item.datetime", end ] }
]}
}
}
}
}
])
list(result)
that return:
[{'_id': ObjectId('604353efad253df2602dfaf9'), 'samples': [{'value': 6, 'datetime': datetime.datetime(2021, 3, 4, 22, 35, 12)}]}]
use find like this i think that it solve your problem:
list(demo.find( { "$and":['samples.datetime':{"$gte":start}, 'samples.datetime':{"$lte":end} ] } ))

Convert JSON to Excel by Python

I have a JSON that need to convert to Excel.
I'm using Python 3.8 with xlsxwriter library.
Below is sample JSON.
{
"companyId": "123456",
"companyName": "Test",
"companyStatus": "ACTIVE",
"document": {
"employee": {
"employeeId": "EM1567",
"employeeLastName": "Test Last",
"employeeFirstName": "Test Fist"
},
"expenseEntry": [
{
"allocation": [
{
"allocationId": "03B249B3598",
"journal": [
{
"journalAccountCode": "888",
"journalPayee": "EMPL",
"journalPayer": "COMP",
"taxGuid": [
"51645A638114E"
]
},
{
"journalAccountCode": "999",
"journalPayee": "EMPL",
"journalPayer": "EMPL",
"taxGuid": [
"8114E51645A63"
]
},
],
"tax": [
{
"taxCode": "TAX123",
"taxSource": "SYST"
},
{
"taxCode": "TAX456",
"taxSource": "SYST"
}
]
}
],
"approvedAmount": 200.0,
"entryDate": "2020-12-10",
"entryId": "ENTRY9988"
}
],
"report": {
"currencyCode": "USD",
"reportCreationDate": "2020-12-10",
"reportId": "ACA849BBB",
"reportName": "Test Report",
"totalApprovedAmount": 200.0
}
},
"id": "c71b7d756f549"
}
And my current code:
https://repl.it/#tonyiscoming/jsontoexcel
I tried with pandas
import pandas as pd
df = pd.json_normalize(data, max_level=5)
df.to_excel('test.xlsx', index=False)
And got the result
I tried with json_excel_converter
from json_excel_converter import Converter
from json_excel_converter.xlsx import Writer
conv = Converter()
conv.convert(data, Writer(file='test.xlsx'))
And got the result
This is my expectation
Would anyone please help me in this case? Thank you so much.
Here is the code what you are looking for. I did this using XlsxWriter package. First I made the template with some cell format stuff. After that, I entered values using according to your JSON.
import xlsxwriter
from itertools import zip_longest
data = [
{
"companyId": "123456",
"companyName": "Test",
"companyStatus": "ACTIVE",
"document": {
"employee": {
"employeeId": "EM1567",
"employeeLastName": "Test Last",
"employeeFirstName": "Test Fist"
},
"expenseEntry": [
{
"allocation": [
{
"allocationId": "03B249B3598",
"journal": [
{
"journalAccountCode": "888",
"journalPayee": "EMPL",
"journalPayer": "COMP",
"taxGuid": [
"51645A638114E"
]
},
{
"journalAccountCode": "999",
"journalPayee": "EMPL",
"journalPayer": "EMPL",
"taxGuid": [
"8114E51645A63"
]
},
],
"tax": [
{
"taxCode": "TAX123",
"taxSource": "SYST"
},
{
"taxCode": "TAX456",
"taxSource": "SYST"
}
]
}
],
"approvedAmount": 200.0,
"entryDate": "2020-12-10",
"entryId": "ENTRY9988"
}
],
"report": {
"currencyCode": "USD",
"reportCreationDate": "2020-12-10",
"reportId": "ACA849BBB",
"reportName": "Test Report",
"totalApprovedAmount": 200.0
}
},
"id": "c71b7d756f549"
}
]
xlsx_file = 'your_file_name_here.xlsx'
# define the excel file
workbook = xlsxwriter.Workbook(xlsx_file)
# create a sheet for our work, defaults to Sheet1.
worksheet = workbook.add_worksheet()
# common merge format
merge_format = workbook.add_format({'align': 'center', 'valign': 'vcenter'})
# set all column width to 20
worksheet.set_column('A:V', 20)
# column wise template creation (A-V)
worksheet.merge_range(0, 0, 4, 0, 'companyId', merge_format) # A
worksheet.merge_range(0, 1, 4, 1, 'companyName', merge_format) # B
worksheet.merge_range(0, 2, 4, 2, 'companyStatus', merge_format) # C
worksheet.merge_range(0, 3, 0, 20, 'document', merge_format) # C-U
worksheet.merge_range(1, 3, 1, 5, 'employee', merge_format) # D-F
worksheet.merge_range(2, 3, 4, 3, 'employeeId', merge_format) # D
worksheet.merge_range(2, 4, 4, 4, 'employeeLastName', merge_format) # E
worksheet.merge_range(2, 5, 4, 5, 'employeeFirstName', merge_format) # F
worksheet.merge_range(1, 6, 1, 15, 'expenseEntry', merge_format) # G-P
worksheet.merge_range(2, 6, 2, 12, 'allocation', merge_format) # G-M
worksheet.merge_range(3, 6, 4, 6, 'allocationId', merge_format) # G
worksheet.merge_range(3, 7, 3, 10, 'journal', merge_format) # H-K
worksheet.write(4, 7, 'journalAccountCode') # H
worksheet.write(4, 8, 'journalPayee') # I
worksheet.write(4, 9, 'journalPayer') # J
worksheet.write(4, 10, 'taxGuid') # K
worksheet.merge_range(3, 11, 3, 12, 'tax', merge_format) # L-M
worksheet.write(4, 11, 'taxCode') # L
worksheet.write(4, 12, 'taxSource') # M
worksheet.merge_range(2, 13, 4, 13, 'approvedAmount', merge_format) # N
worksheet.merge_range(2, 14, 4, 14, 'entryDate', merge_format) # O
worksheet.merge_range(2, 15, 4, 15, 'entryId', merge_format) # P
worksheet.merge_range(1, 16, 1, 20, 'report', merge_format) # Q-U
worksheet.merge_range(2, 16, 4, 16, 'currencyCode', merge_format) # Q
worksheet.merge_range(2, 17, 4, 17, 'reportCreationDate', merge_format) # R
worksheet.merge_range(2, 18, 4, 18, 'reportId', merge_format) # S
worksheet.merge_range(2, 19, 4, 19, 'reportName', merge_format) # T
worksheet.merge_range(2, 20, 4, 20, 'totalApprovedAmount', merge_format) # U
worksheet.merge_range(0, 21, 4, 21, 'id', merge_format) # V
# inserting data
row = 5
for obj in data:
worksheet.write(row, 0, obj.get('companyId'))
worksheet.write(row, 1, obj.get('companyName'))
worksheet.write(row, 2, obj.get('companyStatus'))
document = obj.get('document', {})
# employee details
employee = document.get('employee', {})
worksheet.write(row, 3, employee.get('employeeId'))
worksheet.write(row, 4, employee.get('employeeLastName'))
worksheet.write(row, 5, employee.get('employeeFirstName'))
# report details
report = document.get('report', {})
worksheet.write(row, 16, report.get('currencyCode'))
worksheet.write(row, 17, report.get('reportCreationDate'))
worksheet.write(row, 18, report.get('reportId'))
worksheet.write(row, 19, report.get('reportName'))
worksheet.write(row, 20, report.get('totalApprovedAmount'))
worksheet.write(row, 21, obj.get('id'))
# expenseEntry details
expense_entries = document.get('expenseEntry', [])
for expense_entry in expense_entries:
worksheet.write(row, 13, expense_entry.get('approvedAmount'))
worksheet.write(row, 14, expense_entry.get('entryDate'))
worksheet.write(row, 15, expense_entry.get('entryId'))
# allocation details
allocations = expense_entry.get('allocation', [])
for allocation in allocations:
worksheet.write(row, 6, allocation.get('allocationId'))
# journal and tax details
journals = allocation.get('journal', [])
taxes = allocation.get('tax', [])
for journal_and_tax in list(zip_longest(journals, taxes)):
journal, tax = journal_and_tax
worksheet.write(row, 7, journal.get('journalAccountCode'))
worksheet.write(row, 8, journal.get('journalPayee'))
worksheet.write(row, 9, journal.get('journalPayer'))
worksheet.write(row, 11, tax.get('taxCode'))
worksheet.write(row, 12, tax.get('taxSource'))
# taxGuid details
tax_guides = journal.get('taxGuid', [])
if not tax_guides:
row = row + 1
continue
for tax_guide in tax_guides:
worksheet.write(row, 10, tax_guide)
row = row + 1
# finally close the created excel file
workbook.close()
One thing, instead of creating a template in the script you can make your own one and save it somewhere else. Then get the copy of that template and just add data using the script. This will give you a chance to make your own base template, otherwise, you have to format your excel using the script, such as border formattings, merge cells, etc.
I used zip_longest python built-in function from itertools to zip journal and tax objects. Just follow Python – Itertools.zip_longest() or Python's zip_longest Function article for examples. If you didn't understand anything from my code, please comment below.
Having empty cells in an Excel Grid is not something really "propper", which is why json_excel_converter beahaves like this.
So, If you want to achieve this, I'm afraid you'll have to develop it all by yourself.

How to retrieve the top-5 largest values (integers) from a dictionary of key-value pairs?

I have created 3 dictionaries from a function that iterates through a dataframe of the popular ios apps. The 3 dictionaries contain key-value pairs based on how often the key occur in the dataframe. From these dictionaries, I want to retrieve the 5 largest values of each dictionary as well as the corresponding keys. These are the results from the dataframe iterations. Obviously I can see this manually but I want python to determine the 5 largest.
Prices: {0.0: 415, 4.99: 10, 2.99: 13, 0.99: 31, 1.99: 13, 9.99: 1, 3.99: 2, 6.99: 3}
Genres: {'Productivity': 9, 'Shopping': 12, 'Reference': 3, 'Finance': 6, 'Music': 19, 'Games': 308, 'Travel': 6, 'Sports': 7, 'Health & Fitness': 8, 'Food & Drink': 4, 'Entertainment': 19, 'Photo & Video': 25, 'Social Networking': 21, 'Business': 4, 'Lifestyle': 4, 'Weather': 8, 'Navigation': 2, 'Book': 4, 'News': 2, 'Utilities': 12, 'Education': 5}
Content Ratings: {'4+': 304, '12+': 100, '9+': 54, '17+': 30}
You can sort the dictionaries by values and then slice the top 5:
sorted(Prices, key=Prices.get, reverse=True)[:5]
Same for the other two dicts.
you can also accomplish this using itemgetter.
prices= {0.0: 415, 4.99: 10, 2.99: 13, 0.99: 31, 1.99: 13, 9.99: 1, 3.99: 2, 6.99: 3}
genres= {'Productivity': 9, 'Shopping': 12, 'Reference': 3, 'Finance': 6, 'Music': 19, 'Games': 308, 'Travel': 6, 'Sports': 7, 'Health & Fitness': 8, 'Food & Drink': 4, 'Entertainment': 19, 'Photo & Video': 25, 'Social Networking': 21, 'Business': 4, 'Lifestyle': 4, 'Weather': 8, 'Navigation': 2, 'Book': 4, 'News': 2, 'Utilities': 12, 'Education': 5}
contentRatings= {'4+': 304, '12+': 100, '9+': 54, '17+': 30}
arr = [prices,contentRatings,genres]
from operator import itemgetter
for test_dict in arr:
# printing original dictionary
print("The original dictionary is : " + str(test_dict))
# 5 largest values in dictionary
# Using sorted() + itemgetter() + items()
res = dict(sorted(test_dict.items(), key = itemgetter(1), reverse = True)[:5])
# printing result
print("The top 5 value pairs are " + str(res))

Categories

Resources