Parse all valid datetime strings in json recursively - python

I have a json blob of the following format. Is there a way to identify all strings which match the format
%Y-%m-%dT%H:%M:%S
And convert them to datettime strings
{
"data":[
{
"name":"Testing",
"dob":"2001-01-01T01:00:30"
},
{
"name":"Testing2",
"dob":"2001-01-01T01:00:30",
"licence_info":{
"issue_date":"2020-01-01T01:00:30"
}
}
]
}

The easiest way to do this is to parse each value and attempt to convert it to a datetime. You could do something like this:
from datetime import datetime
def convert_dates(value):
if isinstance(value, dict):
return { k : convert_dates(v) for k, v in value.items() }
elif isinstance(value, list):
return [ convert_dates(v) for v in value ]
else:
try:
dt = datetime.strptime(value, '%Y-%m-%dT%H:%M:%S')
return dt
except ValueError:
return value
jstr = '''
{
"data":[
{
"name":"Testing",
"dob":"2001-01-01T01:00:30"
},
{
"name":"Testing2",
"dob":"2001-01-01T01:00:30",
"licence_info":{
"issue_date":"2020-01-01T01:00:30"
}
}
]
}
'''
d = json.loads(jstr)
convert_dates(d)
Output:
{
'data': [
{'name': 'Testing',
'dob': datetime.datetime(2001, 1, 1, 1, 0, 30)
},
{'name': 'Testing2',
'dob': datetime.datetime(2001, 1, 1, 1, 0, 30),
'licence_info': {'issue_date': datetime.datetime(2020, 1, 1, 1, 0, 30)}
}
]
}

Related

I want to change value in json array object based on index number using pyspark

I want to change value in json array object based on index number using pyspark, then will use columnName to update dataframe column names:
input:
jsonArray = [
{
"index": 1,
"columnName":"Names"
},
{
"index": 2,
"columnName":"City"
}
]
output:
jsonArray = [
{
"index": 1,
"columnName":"titles"
},
{
"index": 2,
"columnName":"countries"
}
]
function header:
def renameColumn(index, newName, df):
return df_with_new_column_names
If I understood your requirement correctly, try something as below-
for i in range(len(jsonArray)):
if jsonArray[i]['index']==1:
jsonArray[i]['columnName'] = "titles"
else:
jsonArray[i]['columnName'] = "countries"
print(jsonArray)
Output -
[{'index': 1, 'columnName': 'titles'}, {'index': 2, 'columnName': 'countries'}]
def jsonColumnName(jsonArray, indx, newName):
for jsonObj in jsonArray:
if jsonObj['index'] == indx:
jsonObj['Field_Name'] = newName
return jsonArray

Turn dict with duplicate keys into list containing these keys

I receive a response I have no control over from an API. Using requests response.json() will filter out duplicate keys. So I would need to turn this response into a list where each key is an element in that list: What I get now:
{
"user": {
//...
},
"user": {
//...
},
//...
}
What I need:
{
"users": [
{
"user": {
//...
}
},
{
"user": {
//...
}
},
//...
]
}
This way JSON won't filter out any of the results, and I can loop through users.
Okay, let me have a try by method used in Python json parser allow duplicate keys
All we should do is handle the pairs_list by ourself.
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"key": 2, "key": 3}, "foo": 4, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
pairs_list = decoder.decode(data)
# the pairs_list is the real thing which we can use
aggre_key = 's'
def recusive_handle(pairs_list):
dct = {}
for k, v in pairs_list:
if v and isinstance(v, list) and isinstance(v[0], tuple):
v = recusive_handle(v)
if k + aggre_key in dct:
dct[k + aggre_key].append({k: v})
elif k in dct:
first_dict = {k: dct.pop(k)}
dct[k + aggre_key] = [first_dict, {k: v}]
else:
dct[k] = v
return dct
print(recusive_handle(pairs_list))
output:
{'foos': [{'foo': {'keys': [{'key': 2}, {'key': 3}]}}, {'foo': {'bar': 4}}, {'foo': 23}]}

Python sort JSON based on value

I need to sort my JSON based on value in ascending/descending order in PYTHON
This is my JSON:
{
"efg": 1,
"mnp": 4,
"xyz": 3
}
expected output is :
{
"mnp": 4,
"xyz": 3,
"efg": 1,
}
The above is just a sample JSON, Actual JSON is much bigger
And how to reverse sort it based on value
{
"efg": 1,
"xyz": 3,
"mnp": 4
}
Please help
-Ashish
import json
from collections import OrderedDict
json_str = """
{
"efg": 1,
"mnp": 4,
"xyz": 3
}
"""
json_dict = json.loads(json_str)
dict_sorted = OrderedDict(sorted(json_dict.items(), key=lambda x: x[1]))
str_sorted = json.dumps(dict_sorted) # '{"efg": 1, "xyz": 3, "mnp": 4}'

AWS Glue: How to expand nested Hive struct to Dict?

I'm trying to expand field mappings in a Table mapped by my AWS Glue crawler to a nested dictionary in Python. But, I can't find any Spark/Hive parsers to deserialize the
var_type = 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'
string located in table_schema['Table']['StorageDescriptor']['Columns'] to a Python dict.
How to dump the table definition in Glue:
import boto3
client = boto3.client('glue')
client.get_table(DatabaseName=selected_db, Name=selected_table)
Response:
table_schema = {'Table': {'Name': 'asdfasdf',
'DatabaseName': 'asdfasdf',
'Owner': 'owner',
'CreateTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'UpdateTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'LastAccessTime': datetime.datetime(2019, 7, 29, 13, 20, 13, tzinfo=tzlocal()),
'Retention': 0,
'StorageDescriptor': {'Columns': [{'Name': 'version', 'Type': 'int'},
{'Name': 'payload',
'Type': 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'},
{'Name': 'origin', 'Type': 'string'}],
'Location': 's3://asdfasdf/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'NumberOfBuckets': -1,
'SerdeInfo': {'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',
'Parameters': {'paths': 'origin,payload,version'}},
'BucketColumns': [],
'SortColumns': [],
'Parameters': {'CrawlerSchemaDeserializerVersion': '1.0',
'CrawlerSchemaSerializerVersion': '1.0',
'UPDATED_BY_CRAWLER': 'asdfasdf',
'averageRecordSize': '799',
'classification': 'json',
'compressionType': 'none',
'objectCount': '94',
'recordCount': '92171',
'sizeKey': '74221058',
'typeOfData': 'file'},
'StoredAsSubDirectories': False},
'PartitionKeys': [{'Name': 'partition_0', 'Type': 'string'},
{'Name': 'partition_1', 'Type': 'string'},
{'Name': 'partition_2', 'Type': 'string'}],
'TableType': 'EXTERNAL_TABLE',
'Parameters': {'CrawlerSchemaDeserializerVersion': '1.0',
'CrawlerSchemaSerializerVersion': '1.0',
'UPDATED_BY_CRAWLER': 'asdfasdf',
'averageRecordSize': '799',
'classification': 'json',
'compressionType': 'none',
'objectCount': '94',
'recordCount': '92171',
'sizeKey': '74221058',
'typeOfData': 'file'},
'CreatedBy': 'arn:aws:sts::asdfasdf'},
'ResponseMetadata': {'RequestId': 'asdfasdf',
'HTTPStatusCode': 200,
'HTTPHeaders': {'date': 'Thu, 01 Aug 2019 16:23:06 GMT',
'content-type': 'application/x-amz-json-1.1',
'content-length': '3471',
'connection': 'keep-alive',
'x-amzn-requestid': 'asdfasdf'},
'RetryAttempts': 0}}
Goal would be a python dictionary and values for each field type, vs. the embedded string. E.g.
expand_function('struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'})
returns
{
'loc_lat':'double',
'service_handler':'string',
'ip_address':'string',
'device':'bigint',
'source':{'id':'string',
'contacts': {
'admin': {
'email':'string',
'name':'string'
},
'name':'string'
},
'loc_name':'string'
}
Thanks!
The accepted answer doesn't handle arrays.
This solution does:
import json
import re
def _hive_struct_to_json(hive_str):
"""
Expands embedded Hive struct strings to Python dictionaries
Args:
Hive struct format as string
Returns
JSON object
"""
r = re.compile(r'(.*?)(struct<|array<|[:,>])(.*)')
root = dict()
to_parse = hive_str
parents = []
curr_elem = root
key = None
while to_parse:
left, operator, to_parse = r.match(to_parse).groups()
if operator == 'struct<' or operator == 'array<':
parents.append(curr_elem)
new_elem = dict() if operator == 'struct<' else list()
if key:
curr_elem[key] = new_elem
curr_elem = new_elem
elif isinstance(curr_elem, list):
curr_elem.append(new_elem)
curr_elem = new_elem
key = None
elif operator == ':':
key = left
elif operator == ',' or operator == '>':
if left:
if isinstance(curr_elem, dict):
curr_elem[key] = left
elif isinstance(curr_elem, list):
curr_elem.append(left)
if operator == '>':
curr_elem = parents.pop()
return root
hive_str = '''
struct<
loc_lat:double,
service_handler:string,
ip_address:string,
device:bigint,
source:struct<
id:string,
contacts:struct<
admin:struct<
email:string,
name:array<string>
>
>,
name:string
>,
loc_name:string,
tags:array<
struct<
key:string,
value:string
>
>
>
'''
hive_str = re.sub(r'[\s]+', '', hive_str).strip()
print(hive_str)
print(json.dumps(_hive_struct_to_json(hive_str), indent=2))
Prints:
struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:array<string>>>,name:string>,loc_name:string,tags:array<struct<key:string,value:string>>>
{
"loc_lat": "double",
"service_handler": "string",
"ip_address": "string",
"device": "bigint",
"source": {
"id": "string",
"contacts": {
"admin": {
"email": "string",
"name": [
"string"
]
}
},
"name": "string"
},
"loc_name": "string",
"tags": [
{
"key": "string",
"value": "string"
}
]
}
Here's a function running on the embedded Hive struct string above.
def _hive_struct_to_json(hive_struct):
"""
Expands embedded Hive struct strings to Python dictionaries
Args:
Hive struct format as string
Returns
JSON object
"""
# Convert embedded hive type definition string to JSON
hive_struct = hive_struct.replace(':', '":"')
hive_struct = hive_struct.replace(',', '","')
hive_struct = hive_struct.replace('struct<', '{"')
hive_struct = hive_struct.replace('"{"', '{"')
hive_struct = hive_struct.replace('>', '"}')
hive_struct = hive_struct.replace('}"', '}')
return json.loads(hive_struct)
hive_str = 'struct<loc_lat:double,service_handler:string,ip_address:string,device:bigint,source:struct<id:string,contacts:struct<admin:struct<email:string,name:string>>,name:string>,loc_name:string>'
print(json.dumps(_hive_struct_to_json(hive_str),indent=2))
Returns:
{
"loc_lat": "double",
"service_handler": "string",
"ip_address": "string",
"device": "bigint",
"source": {
"id": "string",
"contacts": {
"admin": {
"email": "string",
"name": "string"
}
},
"name": "string"
},
"loc_name": "string"
}
I tried to scout from some existing ways and found some helper functions from pyspark.
import pyspark.sql.types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("tmp").getOrCreate()
struct_map = T._parse_datatype_string("MAP < STRING, STRUCT < year: INT, place: STRING, details: STRING >>")
struct_map is a pyspark type that inturn has nested fields to iterate over. Once you have an object like above, performing a recursive call to flatten it should be easy. I am open to hearing opinions from others about this approach.

Printing parsed JSON in Python

Assuming this is the .JSON file I have to parse:
{
"item": {
"allInventory": {
"onHand": 64,
"total": {
"1000": 0,
"1001": 6,
"1002": 5,
"1003": 3,
"1004": 12,
"1005": 0
}
}
},
"image": {
"tag": "/828402de-6cc8-493e-8abd-935a48a3d766_1.285a6f66ecf3ee434100921a3911ce6c.jpeg?odnHeight=450&odnWidth=450&odnBg=FFFFFF"
}
}
How would I go about printing the total values like:
1000 - 0
1001 - 6
1002 - 5
1003 - 4
1004 - 12
1005 - 0
I have already parsed the values, but I'm unsure of how to actually print them. I've already spent awhile on this and couldn't find a solution so any help is appreciated. Here is my code thus far:
import requests
import json
src = requests.get('https://hastebin.com/raw/nenowimite').json()
stats = src['item']['allInventory']['total']
print(stats)
This can be done through a for loop as follows:
for key in stats.keys():
print(key, '-', stats[key])
Using full Python 3.6 you can do (similarly than Ecir's answer)
for key, value in stats.items():
printf(f'{key} - {value}')
but being clearer about what is the key and the value and using the f-string interpolation.
You are almost there:
for item in stats.items():
print '%d - %d' % item
What this does is that stats is already a dict. Looking at the documentation, there is the items method which returns "a copy of the dictionary’s list of (key, value) pairs". And each pair is formatted as two numbers, i.e. '%d - %d'.
You can try:
>>> import json
>>> data= """{
"item": {
"allInventory": {
"onHand": 64,
"total": {
"1000": 0,
"1001": 6,
"1002": 5,
"1003": 3,
"1004": 12,
"1005": 0
}
}
},
"image": {
"tag": "/828402de-6cc8-493e-8abd-935a48a3d766_1.285a6f66ecf3ee434100921a3911ce6c.jpeg?odnHeight=450&odnWidth=450&odnBg=FFFFFF"
}
}"""
>>> data = json.loads(data)
>>> print data["item"]["allInventory"]["total"]
{'1005': 0, '1004': 12, '1003': 3, '1002': 5, '1001': 6, '1000': 0}

Categories

Resources