Parse all valid datetime strings in json recursively

Parse all valid datetime strings in json recursively - python

I have a json blob of the following format. Is there a way to identify all strings which match the format
%Y-%m-%dT%H:%M:%S
And convert them to datettime strings
{
"data":[
{
"name":"Testing",
"dob":"2001-01-01T01:00:30"
},
{
"name":"Testing2",
"dob":"2001-01-01T01:00:30",
"licence_info":{
"issue_date":"2020-01-01T01:00:30"
}
}
]
}

The easiest way to do this is to parse each value and attempt to convert it to a datetime. You could do something like this:
from datetime import datetime
def convert_dates(value):
if isinstance(value, dict):
return { k : convert_dates(v) for k, v in value.items() }
elif isinstance(value, list):
return [ convert_dates(v) for v in value ]
else:
try:
dt = datetime.strptime(value, '%Y-%m-%dT%H:%M:%S')
return dt
except ValueError:
return value
jstr = '''
{
"data":[
{
"name":"Testing",
"dob":"2001-01-01T01:00:30"
},
{
"name":"Testing2",
"dob":"2001-01-01T01:00:30",
"licence_info":{
"issue_date":"2020-01-01T01:00:30"
}
}
]
}
'''
d = json.loads(jstr)
convert_dates(d)
Output:
{
'data': [
{'name': 'Testing',
'dob': datetime.datetime(2001, 1, 1, 1, 0, 30)
},
{'name': 'Testing2',
'dob': datetime.datetime(2001, 1, 1, 1, 0, 30),
'licence_info': {'issue_date': datetime.datetime(2020, 1, 1, 1, 0, 30)}
}
]
}

Related

I want to change value in json array object based on index number using pyspark

I want to change value in json array object based on index number using pyspark, then will use columnName to update dataframe column names:
input:
jsonArray = [
{
"index": 1,
"columnName":"Names"
},
{
"index": 2,
"columnName":"City"
}
]
output:
jsonArray = [
{
"index": 1,
"columnName":"titles"
},
{
"index": 2,
"columnName":"countries"
}
]
function header:
def renameColumn(index, newName, df):
return df_with_new_column_names

If I understood your requirement correctly, try something as below-
for i in range(len(jsonArray)):
if jsonArray[i]['index']==1:
jsonArray[i]['columnName'] = "titles"
else:
jsonArray[i]['columnName'] = "countries"
print(jsonArray)
Output -
[{'index': 1, 'columnName': 'titles'}, {'index': 2, 'columnName': 'countries'}]

def jsonColumnName(jsonArray, indx, newName):
for jsonObj in jsonArray:
if jsonObj['index'] == indx:
jsonObj['Field_Name'] = newName
return jsonArray

Turn dict with duplicate keys into list containing these keys

I receive a response I have no control over from an API. Using requests response.json() will filter out duplicate keys. So I would need to turn this response into a list where each key is an element in that list: What I get now:
{
"user": {
//...
},
"user": {
//...
},
//...
}
What I need:
{
"users": [
{
"user": {
//...
}
},
{
"user": {
//...
}
},
//...
]
}
This way JSON won't filter out any of the results, and I can loop through users.

Okay, let me have a try by method used in Python json parser allow duplicate keys
All we should do is handle the pairs_list by ourself.
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"key": 2, "key": 3}, "foo": 4, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
pairs_list = decoder.decode(data)
# the pairs_list is the real thing which we can use
aggre_key = 's'
def recusive_handle(pairs_list):
dct = {}
for k, v in pairs_list:
if v and isinstance(v, list) and isinstance(v[0], tuple):
v = recusive_handle(v)
if k + aggre_key in dct:
dct[k + aggre_key].append({k: v})
elif k in dct:
first_dict = {k: dct.pop(k)}
dct[k + aggre_key] = [first_dict, {k: v}]
else:
dct[k] = v
return dct
print(recusive_handle(pairs_list))
output:
{'foos': [{'foo': {'keys': [{'key': 2}, {'key': 3}]}}, {'foo': {'bar': 4}}, {'foo': 23}]}

Python sort JSON based on value

I need to sort my JSON based on value in ascending/descending order in PYTHON
This is my JSON:
{
"efg": 1,
"mnp": 4,
"xyz": 3
}
expected output is :
{
"mnp": 4,
"xyz": 3,
"efg": 1,
}
The above is just a sample JSON, Actual JSON is much bigger
And how to reverse sort it based on value
{
"efg": 1,
"xyz": 3,
"mnp": 4
}
Please help
-Ashish

import json
from collections import OrderedDict
json_str = """
{
"efg": 1,
"mnp": 4,
"xyz": 3
}
"""
json_dict = json.loads(json_str)
dict_sorted = OrderedDict(sorted(json_dict.items(), key=lambda x: x[1]))
str_sorted = json.dumps(dict_sorted) # '{"efg": 1, "xyz": 3, "mnp": 4}'

AWS Glue: How to expand nested Hive struct to Dict?

I'm trying to expand field mappings in a Table mapped by my AWS Glue crawler to a nested dictionary in Python. But, I can't find any Spark/Hive parsers to deserialize the
var_type = 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'
string located in table_schema['Table']['StorageDescriptor']['Columns'] to a Python dict.
How to dump the table definition in Glue:
import boto3
client = boto3.client('glue')
client.get_table(DatabaseName=selected_db, Name=selected_table)
Response:
table_schema = {'Table': {'Name': 'asdfasdf',
'DatabaseName': 'asdfasdf',
'Owner': 'owner',
'CreateTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'UpdateTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'LastAccessTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'Retention': 0,
'StorageDescriptor': {'Columns': [{'Name': 'version', 'Type': 'int'},
{'Name': 'payload',
'Type': 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'},
{'Name': 'origin', 'Type': 'string'}],
'Location': 's3://asdfasdf/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'NumberOfBuckets': -1,
'SerdeInfo': {'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',
'Parameters': {'paths': 'origin,payload,version'}},
'BucketColumns': [],
'SortColumns': [],
'Parameters': {'CrawlerSchemaDeserializerVersion': '1.0',
'CrawlerSchemaSerializerVersion': '1.0',
'UPDATED_BY_CRAWLER': 'asdfasdf',
'averageRecordSize': '799',
'classification': 'json',
'compressionType': 'none',
'objectCount': '94',
'recordCount': '92171',
'sizeKey': '74221058',
'typeOfData': 'file'},
'StoredAsSubDirectories': False},
'PartitionKeys': [{'Name': 'partition_0', 'Type': 'string'},
{'Name': 'partition_1', 'Type': 'string'},
{'Name': 'partition_2', 'Type': 'string'}],
'TableType': 'EXTERNAL_TABLE',
'Parameters': {'CrawlerSchemaDeserializerVersion': '1.0',
'CrawlerSchemaSerializerVersion': '1.0',
'UPDATED_BY_CRAWLER': 'asdfasdf',
'averageRecordSize': '799',
'classification': 'json',
'compressionType': 'none',
'objectCount': '94',
'recordCount': '92171',
'sizeKey': '74221058',
'typeOfData': 'file'},
'CreatedBy': 'arn:aws:sts::asdfasdf'},
'ResponseMetadata': {'RequestId': 'asdfasdf',
'HTTPStatusCode': 200,
'HTTPHeaders': {'date': 'Thu, 01 Aug 2019 16:23:06 GMT',
'content-type': 'application/x-amz-json-1.1',
'content-length': '3471',
'connection': 'keep-alive',
'x-amzn-requestid': 'asdfasdf'},
'RetryAttempts': 0}}
Goal would be a python dictionary and values for each field type, vs. the embedded string. E.g.
expand_function('struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'})
returns
{
'loc_lat':'double',
'service_handler':'string',
'ip_address':'string',
'device':'bigint',
'source':{'id':'string',
'contacts': {
'admin': {
'email':'string',
'name':'string'
},
'name':'string'
},
'loc_name':'string'
}
Thanks!

The accepted answer doesn't handle arrays.
This solution does:
import json
import re
def _hive_struct_to_json(hive_str):
"""
Expands embedded Hive struct strings to Python dictionaries
Args:
Hive struct format as string
Returns
JSON object
"""
r = re.compile(r'(.*?)(struct<|array<|[:,>])(.*)')
root = dict()
to_parse = hive_str
parents = []
curr_elem = root
key = None
while to_parse:
left, operator, to_parse = r.match(to_parse).groups()
if operator == 'struct<' or operator == 'array<':
parents.append(curr_elem)
new_elem = dict() if operator == 'struct<' else list()
if key:
curr_elem[key] = new_elem
curr_elem = new_elem
elif isinstance(curr_elem, list):
curr_elem.append(new_elem)
curr_elem = new_elem
key = None
elif operator == ':':
key = left
elif operator == ',' or operator == '>':
if left:
if isinstance(curr_elem, dict):
curr_elem[key] = left
elif isinstance(curr_elem, list):
curr_elem.append(left)
if operator == '>':
curr_elem = parents.pop()
return root
hive_str = '''
struct<
loc_lat:double,
service_handler:string,
ip_address:string,
device:bigint,
source:struct<
id:string,
contacts:struct<
admin:struct<
email:string,
name:array<string>
>
>,
name:string
>,
loc_name:string,
tags:array<
struct<
key:string,
value:string
>
>
>
'''
hive_str = re.sub(r'[\s]+', '', hive_str).strip()
print(hive_str)
print(json.dumps(_hive_struct_to_json(hive_str), indent=2))
Prints:
struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:array<string>>>,name:string>,loc_name:string,tags:array<struct<key:string,value:string>>>
{
"loc_lat": "double",
"service_handler": "string",
"ip_address": "string",
"device": "bigint",
"source": {
"id": "string",
"contacts": {
"admin": {
"email": "string",
"name": [
"string"
]
}
},
"name": "string"
},
"loc_name": "string",
"tags": [
{
"key": "string",
"value": "string"
}
]
}

Here's a function running on the embedded Hive struct string above.
def _hive_struct_to_json(hive_struct):
"""
Expands embedded Hive struct strings to Python dictionaries
Args:
Hive struct format as string
Returns
JSON object
"""
# Convert embedded hive type definition string to JSON
hive_struct = hive_struct.replace(':', '":"')
hive_struct = hive_struct.replace(',', '","')
hive_struct = hive_struct.replace('struct<', '{"')
hive_struct = hive_struct.replace('"{"', '{"')
hive_struct = hive_struct.replace('>', '"}')
hive_struct = hive_struct.replace('}"', '}')
return json.loads(hive_struct)
hive_str = 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'
print(json.dumps(_hive_struct_to_json(hive_str),indent=2))
Returns:
{
"loc_lat": "double",
"service_handler": "string",
"ip_address": "string",
"device": "bigint",
"source": {
"id": "string",
"contacts": {
"admin": {
"email": "string",
"name": "string"
}
},
"name": "string"
},
"loc_name": "string"
}

I tried to scout from some existing ways and found some helper functions from pyspark.
import pyspark.sql.types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("tmp").getOrCreate()
struct_map = T._parse_datatype_string("MAP < STRING, STRUCT < year: INT, place: STRING, details: STRING >>")
struct_map is a pyspark type that inturn has nested fields to iterate over. Once you have an object like above, performing a recursive call to flatten it should be easy. I am open to hearing opinions from others about this approach.

Printing parsed JSON in Python

Assuming this is the .JSON file I have to parse:
{
"item": {
"allInventory": {
"onHand": 64,
"total": {
"1000": 0,
"1001": 6,
"1002": 5,
"1003": 3,
"1004": 12,
"1005": 0
}
}
},
"image": {
"tag": "/828402de-6cc8-493e-8abd-935a48a3d766_1.285a6f66ecf3ee434100921a3911ce6c.jpeg?odnHeight=450&odnWidth=450&odnBg=FFFFFF"
}
}
How would I go about printing the total values like:
1000 - 0
1001 - 6
1002 - 5
1003 - 4
1004 - 12
1005 - 0
I have already parsed the values, but I'm unsure of how to actually print them. I've already spent awhile on this and couldn't find a solution so any help is appreciated. Here is my code thus far:
import requests
import json
src = requests.get('https://hastebin.com/raw/nenowimite').json()
stats = src['item']['allInventory']['total']
print(stats)

This can be done through a for loop as follows:
for key in stats.keys():
print(key, '-', stats[key])

Using full Python 3.6 you can do (similarly than Ecir's answer)
for key, value in stats.items():
printf(f'{key} - {value}')
but being clearer about what is the key and the value and using the f-string interpolation.

You are almost there:
for item in stats.items():
print '%d - %d' % item
What this does is that stats is already a dict. Looking at the documentation, there is the items method which returns "a copy of the dictionary’s list of (key, value) pairs". And each pair is formatted as two numbers, i.e. '%d - %d'.

You can try:
>>> import json
>>> data= """{
"item": {
"allInventory": {
"onHand": 64,
"total": {
"1000": 0,
"1001": 6,
"1002": 5,
"1003": 3,
"1004": 12,
"1005": 0
}
}
},
"image": {
"tag": "/828402de-6cc8-493e-8abd-935a48a3d766_1.285a6f66ecf3ee434100921a3911ce6c.jpeg?odnHeight=450&odnWidth=450&odnBg=FFFFFF"
}
}"""
>>> data = json.loads(data)
>>> print data["item"]["allInventory"]["total"]
{'1005': 0, '1004': 12, '1003': 3, '1002': 5, '1001': 6, '1000': 0}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse all valid datetime strings in json recursively - python

Related

I want to change value in json array object based on index number using pyspark

Turn dict with duplicate keys into list containing these keys

Python sort JSON based on value

AWS Glue: How to expand nested Hive struct to Dict?

Printing parsed JSON in Python

Categories

Resources