AWS Glue: How to expand nested Hive struct to Dict?

AWS Glue: How to expand nested Hive struct to Dict? - python

I'm trying to expand field mappings in a Table mapped by my AWS Glue crawler to a nested dictionary in Python. But, I can't find any Spark/Hive parsers to deserialize the
var_type = 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'
string located in table_schema['Table']['StorageDescriptor']['Columns'] to a Python dict.
How to dump the table definition in Glue:
import boto3
client = boto3.client('glue')
client.get_table(DatabaseName=selected_db, Name=selected_table)
Response:
table_schema = {'Table': {'Name': 'asdfasdf',
'DatabaseName': 'asdfasdf',
'Owner': 'owner',
'CreateTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'UpdateTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'LastAccessTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'Retention': 0,
'StorageDescriptor': {'Columns': [{'Name': 'version', 'Type': 'int'},
{'Name': 'payload',
'Type': 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'},
{'Name': 'origin', 'Type': 'string'}],
'Location': 's3://asdfasdf/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'NumberOfBuckets': -1,
'SerdeInfo': {'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',
'Parameters': {'paths': 'origin,payload,version'}},
'BucketColumns': [],
'SortColumns': [],
'Parameters': {'CrawlerSchemaDeserializerVersion': '1.0',
'CrawlerSchemaSerializerVersion': '1.0',
'UPDATED_BY_CRAWLER': 'asdfasdf',
'averageRecordSize': '799',
'classification': 'json',
'compressionType': 'none',
'objectCount': '94',
'recordCount': '92171',
'sizeKey': '74221058',
'typeOfData': 'file'},
'StoredAsSubDirectories': False},
'PartitionKeys': [{'Name': 'partition_0', 'Type': 'string'},
{'Name': 'partition_1', 'Type': 'string'},
{'Name': 'partition_2', 'Type': 'string'}],
'TableType': 'EXTERNAL_TABLE',
'Parameters': {'CrawlerSchemaDeserializerVersion': '1.0',
'CrawlerSchemaSerializerVersion': '1.0',
'UPDATED_BY_CRAWLER': 'asdfasdf',
'averageRecordSize': '799',
'classification': 'json',
'compressionType': 'none',
'objectCount': '94',
'recordCount': '92171',
'sizeKey': '74221058',
'typeOfData': 'file'},
'CreatedBy': 'arn:aws:sts::asdfasdf'},
'ResponseMetadata': {'RequestId': 'asdfasdf',
'HTTPStatusCode': 200,
'HTTPHeaders': {'date': 'Thu, 01 Aug 2019 16:23:06 GMT',
'content-type': 'application/x-amz-json-1.1',
'content-length': '3471',
'connection': 'keep-alive',
'x-amzn-requestid': 'asdfasdf'},
'RetryAttempts': 0}}
Goal would be a python dictionary and values for each field type, vs. the embedded string. E.g.
expand_function('struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'})
returns
{
'loc_lat':'double',
'service_handler':'string',
'ip_address':'string',
'device':'bigint',
'source':{'id':'string',
'contacts': {
'admin': {
'email':'string',
'name':'string'
},
'name':'string'
},
'loc_name':'string'
}
Thanks!

The accepted answer doesn't handle arrays.
This solution does:
import json
import re
def _hive_struct_to_json(hive_str):
"""
Expands embedded Hive struct strings to Python dictionaries
Args:
Hive struct format as string
Returns
JSON object
"""
r = re.compile(r'(.*?)(struct<|array<|[:,>])(.*)')
root = dict()
to_parse = hive_str
parents = []
curr_elem = root
key = None
while to_parse:
left, operator, to_parse = r.match(to_parse).groups()
if operator == 'struct<' or operator == 'array<':
parents.append(curr_elem)
new_elem = dict() if operator == 'struct<' else list()
if key:
curr_elem[key] = new_elem
curr_elem = new_elem
elif isinstance(curr_elem, list):
curr_elem.append(new_elem)
curr_elem = new_elem
key = None
elif operator == ':':
key = left
elif operator == ',' or operator == '>':
if left:
if isinstance(curr_elem, dict):
curr_elem[key] = left
elif isinstance(curr_elem, list):
curr_elem.append(left)
if operator == '>':
curr_elem = parents.pop()
return root
hive_str = '''
struct<
loc_lat:double,
service_handler:string,
ip_address:string,
device:bigint,
source:struct<
id:string,
contacts:struct<
admin:struct<
email:string,
name:array<string>
>
>,
name:string
>,
loc_name:string,
tags:array<
struct<
key:string,
value:string
>
>
>
'''
hive_str = re.sub(r'[\s]+', '', hive_str).strip()
print(hive_str)
print(json.dumps(_hive_struct_to_json(hive_str), indent=2))
Prints:
struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:array<string>>>,name:string>,loc_name:string,tags:array<struct<key:string,value:string>>>
{
"loc_lat": "double",
"service_handler": "string",
"ip_address": "string",
"device": "bigint",
"source": {
"id": "string",
"contacts": {
"admin": {
"email": "string",
"name": [
"string"
]
}
},
"name": "string"
},
"loc_name": "string",
"tags": [
{
"key": "string",
"value": "string"
}
]
}

Here's a function running on the embedded Hive struct string above.
def _hive_struct_to_json(hive_struct):
"""
Expands embedded Hive struct strings to Python dictionaries
Args:
Hive struct format as string
Returns
JSON object
"""
# Convert embedded hive type definition string to JSON
hive_struct = hive_struct.replace(':', '":"')
hive_struct = hive_struct.replace(',', '","')
hive_struct = hive_struct.replace('struct<', '{"')
hive_struct = hive_struct.replace('"{"', '{"')
hive_struct = hive_struct.replace('>', '"}')
hive_struct = hive_struct.replace('}"', '}')
return json.loads(hive_struct)
hive_str = 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'
print(json.dumps(_hive_struct_to_json(hive_str),indent=2))
Returns:
{
"loc_lat": "double",
"service_handler": "string",
"ip_address": "string",
"device": "bigint",
"source": {
"id": "string",
"contacts": {
"admin": {
"email": "string",
"name": "string"
}
},
"name": "string"
},
"loc_name": "string"
}

I tried to scout from some existing ways and found some helper functions from pyspark.
import pyspark.sql.types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("tmp").getOrCreate()
struct_map = T._parse_datatype_string("MAP < STRING, STRUCT < year: INT, place: STRING, details: STRING >>")
struct_map is a pyspark type that inturn has nested fields to iterate over. Once you have an object like above, performing a recursive call to flatten it should be easy. I am open to hearing opinions from others about this approach.

Related

Dealing with scope in a recursive Python function

I have a JSON input file that looks like this:
{"nodes": [
{"properties": {
"id": "rootNode",
"name": "Bertina Dunmore"},
"nodes": [
{"properties": {
"id": 1,
"name": "Gwenneth Rylett",
"parent_id": "rootNode"},
"nodes": [
{"properties": {
"id": 11,
"name": "Joell Waye",
"parent_id": 1}},
{"properties": {
"id": 12,
"name": "Stan Willcox",
"parent_id": 1}}]},
{"properties": {
"id": 2,
"name": "Delbert Dukesbury",
"parent_id": "rootNode"},
"nodes": [
{"properties": {
"id": 21,
"name": "Cecil McKeever",
"parent_id": 2}},
{"properties": {
"id": 22,
"name": "Joy Obee",
"parent_id": 2}}]}]}]}
I want to get the nested properties dictionaries into a (flat) list of dictionaries. Creating a recursive function that will read this dictionaries is easy:
def get_node(nodes):
for node in nodes:
print(node['properties'])
if 'nodes' in node.keys():
get_node(node['nodes'])
Now, I'm struggling to append these to a single list:
def get_node(nodes):
prop_list = []
for node in nodes:
print(node['properties'])
prop_list.append(node['properties'])
if 'nodes' in node.keys():
get_node(node['nodes'])
return prop_list
This returns [{'id': 'rootNode', 'name': 'Bertina Dunmore'}], even though all properties dictionaries are printed. I suspect that this is because I'm not handling the function scope properly.
Can someone please help me get my head around this?

your problem is that every time you call get_node, the list where you append is initialized again. you can avoid this by passing the list to append in the recursive function
Moreover, I think would be nice to use dataclass to deal with this problem,
from dataclasses import dataclass
from typing import Union
#dataclass
class Property:
id: int
name: str
parent_id: Union[str, None] = None
def explore_json(data, properties: list=None):
if properties is None:
properties = []
for key, val in data.items():
if key == "nodes":
for node in val:
explore_json(node, properties)
elif key == "properties":
properties.append(Property(**val))
return properties
explore_json(data)
output
[Property(id='rootNode', name='Bertina Dunmore', parent_id=None),
Property(id=1, name='Gwenneth Rylett', parent_id='rootNode'),
Property(id=11, name='Joell Waye', parent_id=1),
Property(id=12, name='Stan Willcox', parent_id=1),
Property(id=2, name='Delbert Dukesbury', parent_id='rootNode'),
Property(id=21, name='Cecil McKeever', parent_id=2),
Property(id=22, name='Joy Obee', parent_id=2)]

You need to combine the prop_list returned by the recursive call with the prop_list in the current scope. For example,
def get_node(nodes):
prop_list = []
for node in nodes:
print(node['properties'])
prop_list.append(node['properties'])
if 'nodes' in node.keys():
prop_list.extend(get_node(node['nodes']))
return prop_list

With that:
def get_node(prop_list, nodes):
for node in nodes:
print(node['properties'])
prop_list.append(node['properties'])
if 'nodes' in node.keys():
get_node(prop_list, node['nodes'])
You can just do:
prop_list = []
get_node(prop_list, <yourdictnodes>)
Should alter prop_list into:
{'id': 'rootNode', 'name': 'Bertina Dunmore'}
{'id': 1, 'name': 'Gwenneth Rylett', 'parent_id': 'rootNode'}
{'id': 11, 'name': 'Joell Waye', 'parent_id': 1}
{'id': 12, 'name': 'Stan Willcox', 'parent_id': 1}
{'id': 2, 'name': 'Delbert Dukesbury', 'parent_id': 'rootNode'}
{'id': 21, 'name': 'Cecil McKeever', 'parent_id': 2}
{'id': 22, 'name': 'Joy Obee', 'parent_id': 2}

parse weird yaml file uploaded to server with python

I have a config server where we read the service config from.
In there we have a yaml file that I need to read but it has a weird format on the server looking like:
{
"document[0].Name": "os",
"document[0].Rules.Rule1": false,
"document[0].Rules.Rule2": true,
"document[0].MinScore": 100,
"document[0].MaxScore": 100,
"document[0].ClusterId": 22,
"document[0].Enabled": true,
"document[0].Module": "device",
"document[0].Description": "",
"document[0].Modified": 1577880000000,
"document[0].Created": 1577880000000,
"document[0].RequiredReview": false,
"document[0].Type": "NO_CODE",
"document[1].Name": "rule with params test",
"document[1].Rules.Rule": false,
"document[1].MinScore": 100,
"document[1].MaxScore": 100,
"document[1].ClusterId": 29,
"document[1].Enabled": true,
"document[1].Module": "device",
"document[1].Description": "rule with params test",
"document[1].Modified": 1577880000000,
"document[1].Created": 1577880000000,
"document[1].RequiredReview": false,
"document[1].Type": "NO_CODE",
"document[1].ParametersRules[0].Features.feature1.op": ">",
"document[1].ParametersRules[0].Features.feature1.value": 10,
"document[1].ParametersRules[0].Features.feature2.op": "==",
"document[1].ParametersRules[0].Features.feature2.value": true,
"document[1].ParametersRules[0].Features.feature3.op": "range",
"document[1].ParametersRules[0].Features.feature3.value[0]": 4,
"document[1].ParametersRules[0].Features.feature3.value[1]": 10,
"document[1].ParametersRules[0].Features.feature4.op": "!=",
"document[1].ParametersRules[0].Features.feature4.value": "None",
"document[1].ParametersRules[0].DecisionType": "all",
"document[1].ParametersRules[1].Features.feature5.op": "<",
"document[1].ParametersRules[1].Features.feature5.value": 1000,
"document[1].ParametersRules[1].DecisionType": "any"
}
and this is how the dict supposed to look like (might not be perfect I did it by hand):
[
{
"Name": "os",
"Rules": { "Rule1": false, "Rule2": true },
"MinScore": 100,
"MaxScore": 100,
"ClusterId": 22,
"Enabled": true,
"Module": "device",
"Description": "",
"Modified": 1577880000000,
"Created": 1577880000000,
"RequiredReview": false,
"Type": "NO_CODE"
},
{
"Name": "rule with params test",
"Rules": { "Rule": false},
"MinScore": 100,
"MaxScore": 100,
"ClusterId": 29,
"Enabled": true,
"Module": "device",
"Description": "rule with params test",
"Modified": 1577880000000,
"Created": 1577880000000,
"RequiredReview": false,
"Type": "NO_CODE",
"ParametersRules":[
{"Features": {"feature1": {"op": ">", "value": 10},
"feature2": {"op": "==", "value": true},
"feature3": {"op": "range", "value": [4,10]},
"feature4": {"op": "!=", "value": "None"}} ,
"DecisionType": "all"},
{"Features": { "feature5": { "op": "<", "value": 1000 }},
"DecisionType": "any"}
]
}
]
I don't have a way to change how the file is uploaded to the server (it's a different team and quite the headache) so I need to parse it using python.
My thought is that someone probably encountered it before so there must be a package that solves it, and I hoped that someone here might know.
Thanks.

i have a sample , i hope it'll help you
import yaml
import os
file_dir = os.path.dirname(os.path.abspath(__file__))
config = yaml.full_load(open(f"{file_dir}/file.json"))
yaml_file = open(f'{file_dir}/meta.yaml', 'w+')
yaml.dump(config, yaml_file, allow_unicode=True) # this one make your json file to yaml
your current output is :
- ClusterId: 22
Created: 1577880000000
Description: ''
Enabled: true
MaxScore: 100
MinScore: 100
Modified: 1577880000000
Module: device
Name: os
RequiredReview: false
Rules:
Rule1: false
Rule2: true
Type: NO_CODE
- ClusterId: 29
Created: 1577880000000
Description: rule with params test
Enabled: true
MaxScore: 100
MinScore: 100
Modified: 1577880000000
Module: device
Name: rule with params test
ParametersRules:
- DecisionType: all
Features:
feature1:
op: '>'
value: 10
feature2:
op: ==
value: true
feature3:
op: range
value:
- 4
- 10
feature4:
op: '!='
value: None
- DecisionType: any
Features:
feature5:
op: <
value: 1000
RequiredReview: false
Rules:
Rule: false
Type: NO_CODE

Here is my approach so far. It's far from perfect, but hope it gives you an idea of how to tackle it.
from __future__ import annotations # can be removed in Python 3.10+
def clean_value(o: str | bool | int) -> str | bool | int | None:
"""handle int, None, or bool values encoded as a string"""
if isinstance(o, str):
lowercase = o.lower()
if lowercase.isnumeric():
return int(o)
elif lowercase == 'none':
return None
elif lowercase in ('true', 'false'):
return lowercase == 'true'
# return eval(o.capitalize())
return o
# noinspection PyUnboundLocalVariable
def process(o: dict):
# final return list
docs_list = []
doc: dict[str, list | dict | str | bool | int | None]
doc_idx: int
def add_new_doc(new_idx: int):
"""Push new item to result list, and increment index."""
nonlocal doc_idx, doc
doc_idx = new_idx
doc = {}
docs_list.append(doc)
# add initial `dict` object to return list
add_new_doc(0)
for k, v in o.items():
doc_id, key, *parts = k.split('.')
doc_id: str
key: str
parts: list[str]
curr_doc_idx = int(doc_id.rsplit('[', 1)[1].rstrip(']'))
if curr_doc_idx > doc_idx:
add_new_doc(curr_doc_idx)
if not parts:
final_val = clean_value(v)
elif key in doc:
# For example, when we encounter `document[0].Rules.Rule2`, but we've already encountered
# `document[0].Rules.Rule1` - so in this case, we add value to the existing dict.
final_val = temp_dict = doc[key]
temp_dict: dict
for p in parts[:-1]:
temp_dict = temp_dict.setdefault(p, {})
temp_dict[parts[-1]] = clean_value(v)
else:
final_val = temp_dict = {}
for p in parts[:-1]:
temp_dict = temp_dict[p] = {}
temp_dict[parts[-1]] = clean_value(v)
doc[key] = final_val
return docs_list
if __name__ == '__main__':
import json
from pprint import pprint
j = """{
"document[0].Name": "os",
"document[0].Rules.Rule1": false,
"document[0].Rules.Rule2": "true",
"document[0].MinScore": 100,
"document[0].MaxScore": 100,
"document[0].ClusterId": 22,
"document[0].Enabled": true,
"document[0].Module": "device",
"document[0].Description": "",
"document[0].Modified": 1577880000000,
"document[0].Created": 1577880000000,
"document[0].RequiredReview": false,
"document[0].Type": "NO_CODE",
"document[1].Name": "rule with params test",
"document[1].Rules.Rule": false,
"document[1].MinScore": 100,
"document[1].MaxScore": 100,
"document[1].ClusterId": 29,
"document[1].Enabled": true,
"document[1].Module": "device",
"document[1].Description": "rule with params test",
"document[1].Modified": 1577880000000,
"document[1].Created": 1577880000000,
"document[1].RequiredReview": false,
"document[1].Type": "NO_CODE",
"document[1].ParametersRules[0].Features.feature1.op": ">",
"document[1].ParametersRules[0].Features.feature1.value": 10,
"document[1].ParametersRules[0].Features.feature2.op": "==",
"document[1].ParametersRules[0].Features.feature2.value": true,
"document[1].ParametersRules[0].Features.feature3.op": "range",
"document[1].ParametersRules[0].Features.feature3.value[0]": 4,
"document[1].ParametersRules[0].Features.feature3.value[1]": 10,
"document[1].ParametersRules[0].Features.feature4.op": "!=",
"document[1].ParametersRules[0].Features.feature4.value": "None",
"document[1].ParametersRules[0].DecisionType": "all",
"document[1].ParametersRules[1].Features.feature5.op": "<",
"document[1].ParametersRules[1].Features.feature5.value": 1000,
"document[1].ParametersRules[1].DecisionType": "any"
}"""
d: dict[str, str | bool | int | None] = json.loads(j)
result = process(d)
pprint(result)
Result:
[{'ClusterId': 22,
'Created': 1577880000000,
'Description': '',
'Enabled': True,
'MaxScore': 100,
'MinScore': 100,
'Modified': 1577880000000,
'Module': 'device',
'Name': 'os',
'RequiredReview': False,
'Rules': {'Rule1': False, 'Rule2': True},
'Type': 'NO_CODE'},
{'ClusterId': 29,
'Created': 1577880000000,
'Description': 'rule with params test',
'Enabled': True,
'MaxScore': 100,
'MinScore': 100,
'Modified': 1577880000000,
'Module': 'device',
'Name': 'rule with params test',
'ParametersRules[0]': {'DecisionType': 'all',
'Features': {'feature1': {'value': 10},
'feature2': {'op': '==', 'value': True},
'feature3': {'op': 'range',
'value[0]': 4,
'value[1]': 10},
'feature4': {'op': '!=', 'value': None}}},
'ParametersRules[1]': {'DecisionType': 'any',
'Features': {'feature5': {'value': 1000}}},
'RequiredReview': False,
'Rules': {'Rule': False},
'Type': 'NO_CODE'}]
Of course one of the problems is that it doesn't accounted for nested paths like document[1].ParametersRules[0].Features.feature1.op which should ideally create a new sub-list to add values to.

How to get specific json value - python

I am migrating my code from java to python, but I am still having some difficulties understanding how to fetch a specific path in json using python.
This is my Java code, which returns a list of accountsId.
public static List < String > v02_JSON_counterparties(String date) {
baseURI = "https://cdwp/cdw";
String counterparties =
given()
.auth().basic(getJiraUser(), getJiraPass())
.param("limit", "1000000")
.param("count", "false")
.when()
.get("/counterparties/" + date).body().asString();
List < String > accountId = extract_accountId(counterparties);
return accountId;
}
public static List < String > extract_accountId(String res) {
List < String > ids = JsonPath.read(res, "$..identifier[?(#.accountIdType == 'ACCOUNTID')].accountId");
return ids;
}
And this is the json structure where I am getting the accountID.
{
'organisationId': {
'#value': 'MHI'
},
'accountName': 'LAZARD AM DEUT AC LF1632',
'identifiers': {
'accountId': 'LAZDLF1632',
'customerId': 'LAZAMDEUSG',
'blockAccountCode': 'LAZDEUBDBL',
'bic': 'LAMDDEF1XXX',
'identifier': [{
'accountId': 'MHI',
'accountIdType': 'REVNCNTR'
}, {
'accountId': 'LAZDLF1632',
'accountIdType': 'ACCOUNTID'
}, {
'accountId': 'LAZAMDEUSG',
'accountIdType': 'MHICUSTID'
}, {
'accountId': 'LAZDEUBDBL',
'accountIdType': 'BLOCKACCOUNT'
}, {
'accountId': 'LAMDDEF1XXX',
'accountIdType': 'ACCOUNTBIC'
}, {
'accountId': 'LAZDLF1632',
'accountIdType': 'GLOBEOP'
}]
},
'isBlocAccount': 'N',
'accountStatus': 'COMPLETE',
'products': {
'productType': [{
'productLineName': 'CASH',
'productTypeId': 'PRODMHI1',
'productTypeName': 'Bond, Equity,Convertible Bond',
'cleared': 'N',
'bilateral': 'N',
'limitInstructions': {
'limitInstruction': [{
'limitAmount': '0',
'limitCurrency': 'GBP',
'limitType': 'PEAEXPLI',
'limitTypeName': 'Cash-Peak Exposure Limit'
}]
}
}]
},
'etc': {
'addressGeneral': 'LZFLUS33XXX',
'addressAccount': 'LF1632',
'tradingLevel': 'B'
},
'clientBroker': 'C',
'costCentre': 'Credit Sales',
'clientLevel': 'SUBAC',
'accountCreationDate': '2016-10-19T00:00:00.000Z',
'accountOpeningDate': '2016-10-19T00:00:00.000Z'
}
This is my code in Python
import json, requests, urllib.parse, re
from pandas.io.parsers import read_csv
import pandas as pd
from termcolor import colored
import numpy as np
from glob import glob
import os
# Set Up
dateinplay = "2021-09-27"
#Get accountId
cdwCounterparties = (
f"http://cdwu/cdw/counterparties/?limit=1000000?yyyy-mm-dd={dateinplay}"
)
r = json.loads(requests.get(cdwCounterparties).text)
account_ids = [i['accountId'] for i in data['identifiers']['identifier']if i['accountIdType']=="ACCOUNTID"]
I am getting this error when I try to fetch the accountId:
Traceback (most recent call last):
File "h:\DESKTOP\test_check\checkCounterpartie.py", line 54, in <module>
account_ids = [i['accountId'] for i in data['identifiers']['identifier']if i['accountIdType']=="ACCOUNTID"]
TypeError: list indices must be integers or slices, not str

If I'm inerpeting your question correctly you want all ids where accountistype is "ACCOUNTID".
this give you that:
account_ids = [i['accountId'] for i in data['identifiers']['identifier']if i['accountIdType']=="ACCOUNTID"]

accs = {
"identifiers": {
...
account_id_list = []
for acc in accs.get("identifiers", {}).get("identifier", []):
account_id_list.append(acc.get("accountId", ""))
creates a list called account_id_list which is
['MHI', 'DKEPBNPGIV', 'DKEPLLP SG', 'DAVKEMEQBL', '401821', 'DKEPGB21XXX', 'DKEPBNPGIV', 'DKPARTNR']

assuming you store the dictionary (json structure) in variable x, getting all accountIDs is something like:
account_ids = [i['accountId'] for i in x['identifiers']['identifier']]

I'd like to thank you all for your answers. It helped me a lot to find a resolution to my problem.
Below is how I solved it.
listAccountId = []
cdwCounterparties = (
f"http://cdwu/cdw/counterparties/?limit=100000?yyyy-mm-dd={dateinplay}"
)
r = requests.get(cdwCounterparties).json()
jsonpath_expression = parse("$..accounts.account[*].identifiers.identifier[*]")
for match in jsonpath_expression.find(r):
# print(f'match id: {match.value}')
thisdict = match.value
if thisdict["accountIdType"] == "ACCOUNTID":
# print(thisdict["accountId"])
listAccountId.append(thisdict["accountId"])

Save dict of key and list of dict inside key to JSON where dictionary is stored by line

I have a similar question to this previous question. However, my dictionary has a structure like the following
data_dict = {
'refresh_count': 1,
'fetch_date': '10-10-2019',
'modified_date': '',
'data': [
{'date': '10-10-2019', 'title': 'Hello1'},
{'date': '11-10-2019', 'title': 'Hello2'}
]
}
I would like to store it in JSON so that my data is still stored in one dictionary per line. Something like:
{
'refresh_count': 1,
'fetch_date': '10-10-2019',
'modified_date': '',
'data': [
{'date': '10-10-2019', 'title': 'Hello1'},
{'date': '11-10-2019', 'title': 'Hello2'}
]
}
I cannot achieve it using simply using json.dumps (or dump) or the previous solution.
json.dumps(data_dict, indent=2)
>> {
"refresh_count": 1,
"fetch_date": "10-10-2019",
"modified_date": "",
"data": [
{
"date": "10-10-2019",
"title": "Hello1"
},
{
"date": "11-10-2019",
"title": "Hello2"
}
]
}

This is quite a hack, but you can implement a custom JSON encoder that will do what you want (see Custom JSON Encoder in Python With Precomputed Literal JSON). For any object that you do not want to be indented, wrap it with the NoIndent class. The custom JSON encoder will look for this type in the default() method and return a unique string (__N__) and store unindented JSON in self._literal. Later, in the call to encode(), these unique strings are replaced with the unindented JSON.
Note that you need to choose a string format that cannot possibly appear in the encoded data to avoid replacing something unintentionally.
import json
class NoIndent:
def __init__(self, o):
self.o = o
class MyEncoder(json.JSONEncoder):
def __init__(self, *args, **kwargs):
super(MyEncoder, self).__init__(*args, **kwargs)
self._literal = []
def default(self, o):
if isinstance(o, NoIndent):
i = len(self._literal)
self._literal.append(json.dumps(o.o))
return '__%d__' % i
else:
return super(MyEncoder, self).default(o)
def encode(self, o):
s = super(MyEncoder, self).encode(o)
for i, literal in enumerate(self._literal):
s = s.replace('"__%d__"' % i, literal)
return s
data_dict = {
'refresh_count': 1,
'fetch_date': '10-10-2019',
'modified_date': '',
'data': [
NoIndent({'date': '10-10-2019', 'title': 'Hello1'}),
NoIndent({'date': '11-10-2019', 'title': 'Hello2'}),
]
}
s = json.dumps(data_dict, indent=2, cls=MyEncoder)
print(s)
Intermediate representation returned by super(MyEncoder, self).encode(o):
{
"fetch_date": "10-10-2019",
"refresh_count": 1,
"data": [
"__0__",
"__1__"
],
"modified_date": ""
}
Final output:
{
"fetch_date": "10-10-2019",
"refresh_count": 1,
"data": [
{"date": "10-10-2019", "title": "Hello1"},
{"date": "11-10-2019", "title": "Hello2"}
],
"modified_date": ""
}

Unable to append data to array

I am retrieving a record set from a database.
Then using a for statement I am trying to construct my data to match a 3rd party API.
But I get this error and can't figure it out:
"errorType": "TypeError", "errorMessage": "list indices must be
integers, not str"
"messages['english']['merge_vars']['vars'].append({"
Below is my code:
cursor = connect_to_database()
records = get_records(cursor)
template = dict()
messages = dict()
template['english'] = "SOME_TEMPLATE reminder-to-user-english"
messages['english'] = {
'subject': "Reminder (#*|code|*)",
'from_email': 'mail#mail.com',
'from_name': 'Notifier',
'to': [],
'merge_vars': [],
'track_opens': True,
'track_clicks': True,
'important': True
}
for record in records:
record = dict(record)
if record['lang'] == 'english':
messages['english']['to'].append({
'email': record['email'],
'type': 'to'
})
messages['english']['merge_vars'].append({
'rcpt': record['email']
})
for (key, value) in record.iteritems():
messages['english']['merge_vars']['vars'].append({
'name': key,
'content': value
})
else:
template['other'] = "SOME_TEMPLATE reminder-to-user-other"
close_database_connection()
return messages
The goal is to get something like this below:
messages = {
'subject': "...",
'from_email': "...",
'from_name': "...",
'to': [
{
'email': '...',
'type': 'to',
},
{
'email': '...',
'type': 'to',
}
],
'merge_vars': [
{
'rcpt': '...',
'vars': [
{
'content': '...',
'name': '...'
},
{
'content': '...',
'name': '...'
}
]
},
{
'rcpt': '...',
'vars': [
{
'content': '...',
'name': '...'
},
{
'content': '...',
'name': '...'
}
]
}
]
}

This code seems to indicate that messages['english']['merge_vars'] is a list, since you initialize it as such:
messages['english'] = {
...
'merge_vars': [],
...
}
And call append on it:
messages['english']['merge_vars'].append({
'rcpt': record['email']
})
However later, you treat it as a dictionary when you call:
messages['english']['merge_vars']['vars']
It seems what you want is something more like:
vars = [{'name': key, 'content': value} for key, value in record.iteritems()]
messages['english']['merge_vars'].append({
'rcpt': record['email'],
'vars': vars,
})
Then, the for loop is unnecessary.

What the error is saying is that you are trying to access an array element with the help of string not index (int).
I believe your mistake is in this line:
messages['english']['merge_vars']['vars'].append({..})
You declared merge_vars as array like so:
'merge_vars': []
So, you either make it dict like this:
'merge_vars': {}
Or, use it as array:
messages['english']['merge_vars'].append({..})
Hope it helps

Your issues, as the Error Message is saying, is here: messages['english']['merge_vars']['vars'].append({'name': key,'content': value})
The item messages['english']['merge_vars'] is a list and thus you're trying to access an element when you do something like list[i] and i cannot be a string, as is the case with 'vars'. You probably either need to drop the ['vars'] part or set messages['english']['merge_vars'] to be a dict so that it allows for additional indexing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS Glue: How to expand nested Hive struct to Dict? - python

Related

Dealing with scope in a recursive Python function

parse weird yaml file uploaded to server with python

How to get specific json value - python

Save dict of key and list of dict inside key to JSON where dictionary is stored by line

Unable to append data to array

Categories

Resources