Pyspark - get attribute names from json file

Pyspark - get attribute names from json file - python

I am new to pyspark . My requirement is to get/extract the attribute names from a nested json file . I tried using json_normalize imported from pandas package. It works for direct attributes but never fetches the attributes within json array attributes. My json doesn't have a static structure. It varies for each document that we receive. Could someone please help me with explanation for the small example provided below,
{
"id":"1",
"name":"a",
"salaries":[
{
"salary":"1000"
},
{
"salary":"5000"
}
],
"states":{
"state":"Karnataka",
"cities":[
{
"city":"Bangalore"
},
{
"city":"Mysore"
}
],
"state":"Tamil Nadu",
"cities":[
{
"city":"Chennai"
},
{
"city":"Coimbatore"
}
]
}
}
Especially for the json array elements..
Expected output :
id
name
salaries.salary
states.state
states.cities.city``

Here is the another solution for extracting all nested attributes from json
import json
result_set = set([])
def parse_json_array(json_obj, parent_path):
array_obj = list(json_obj)
for i in range(0, len(array_obj)):
json_ob = array_obj[i]
if type(json_obj) == type(json_obj):
parse_json(json_ob, parent_path)
return None
def parse_json(json_obj, parent_path):
for key in json_obj.keys():
key_value = json_obj.get(key)
# if isinstance(a, dict):
if type(key_value) == type(json_obj):
parse_json(key_value, str(key) if parent_path == "" else parent_path + "." + str(key))
elif type(key_value) == type(list(json_obj)):
parse_json_array(key_value, str(key) if parent_path == "" else parent_path + "." + str(key))
result_set.add((parent_path + "." + key).encode('ascii', 'ignore'))
return None
file_name = "C:/input/sample.json"
file_data = open(file_name, "r")
json_data = json.load(file_data)
print json_data
parse_json(json_data, "")
print list(result_set)
Output:
{u'states': {u'state': u'Tamil Nadu', u'cities': [{u'city': u'Chennai'}, {u'city': u'Coimbatore'}]}, u'id': u'1', u'salaries': [{u'salary': u'1000'}, {u'salary': u'5000'}], u'name': u'a'}
['states.cities.city', 'states.cities', '.id', 'states.state', 'salaries.salary', '.salaries', '.states', '.name']
Note:
My Python version: 2.7

you can do in this way also.
data = { "id":"1", "name":"a", "salaries":[ { "salary":"1000" }, { "salary":"5000" } ], "states":{ "state":"Karnataka", "cities":[ { "city":"Bangalore" }, { "city":"Mysore" } ], "state":"Tamil Nadu", "cities":[ { "city":"Chennai" }, { "city":"Coimbatore" } ] } }
def dict_ittr(lin,data):
for k, v in data.items():
if type(v)is list:
for l in v:
dict_ittr(lin+"."+k,l)
elif type(v)is dict:
dict_ittr(lin+"."+k,v)
pass
else:
print lin+"."+k
dict_ittr("",data)
output
.states.state
.states.cities.city
.states.cities.city
.id
.salaries.salary
.salaries.salary
.name

If you treat the json like a python dictionary, this should work.
I just wrote a simple recursive program.
Script
import json
def js_r(filename):
with open(filename) as f_in:
return(json.load(f_in))
g = js_r("city.json")
answer_d = {}
def base_line(g, answer_d):
for key in g.keys():
answer_d[key] = {}
return answer_d
answer_d = base_line(g, answer_d)
def recurser_func(g, answer_d):
for k in g.keys():
if type(g[k]) == type([]): #If the value is a list
answer_d[k] = {list(g[k][0].keys())[0]:{}}
if type(g[k]) == type({}): #If the value is a dictionary
answer_d[k] = {list(g[k].keys())[0]: {}} #set key equal to
answer_d[k] = recurser_func(g[k], answer_d[k])
return answer_d
recurser_func(g,answer_d)
def printer_func(answer_d, list_to_print, parent):
for k in answer_d.keys():
if len(answer_d[k].keys()) == 1:
list_to_print.append(parent)
list_to_print[-1] += k
list_to_print[-1] += "." + str(list(answer_d[k].keys())[0])
if len(answer_d[k].keys()) == 0:
list_to_print.append(parent)
list_to_print[-1] += k
if len(answer_d[k].keys()) > 1:
printer_func(answer_d[k], list_to_print, k + ".")
return list_to_print
l = printer_func(answer_d, [], "")
final = " ".join(l)
print(final)
Explanation
base_line makes a dictionary of all your base keys.
recursur_func checks if the key's value is a list or dict then adds to the answer dictionary as is necessary until answer_d looks like: {'id': {}, 'name': {}, 'salaries': {'salary': {}}, 'states': {'state': {}, 'cities': {'city': {}}}}
After these 2 functions are called you have a dictionary of keys in a sense. Then printer_func is a recursive function to print it as you desired.
NOTE:
Your question is similar to this one: Get all keys of a nested dictionary but since you have a nested list/dictionary instead of just a nested dictionary, their answers won't work for you, but there is more discussion on the topic on that question if you like more info
EDIT 1
my python version is 3.7.1
I have added a json file opener to the top. I assume that the json is named city.json and is in the same directory
EDIT 2: More thorough explanation
The main difficulty that I found with dealing with your data is the fact that you can have infinitely nested lists and dictionaries. This makes it complicated. Since it was infinite possible nesting, I new this was a recursion problem.
So, I build a dictionary of dictionaries representing the key structure that you are looking for. Firstly I start with the baseline.
base_line makes {'id': {}, 'name': {}, 'salaries': {}, 'states': {}} This is a dictionary of empty dictionaries. I know that when you print. Every key structure (like states.state) starts with one of these words.
recursion
Then I add all the child keys using recursur_func.
When given a dictionary g this function for loop through all the keys in that dictionary and (assuming answer_d has each key that g has) for each key will add that keys child to answer_d.
If the child is a dictionary. Then I recurse with the given dictionary g now being the sub-part of the dictionary that pertains to the children, and answer_d being the sub_part of answer_d that pertains to the child.

Related

How to create directory tree from a list of strings in Python?

I would like to write a fucntion that mimics a folder structure in Python. Given a list of strings I would like to create a tree of folders & subfolders. For example:
['beers', 'wines', 'beers/ipa/stone', 'wines/red/cabernet']
Would output a dictionary with the following:
{
'beers': {
'ipa': {
'stone': {}
}
},
'wines': {
'red': {
'cabernet': {}
}
}
}

x = ['beers', 'wines', 'beers/ipa/stone', 'wines/red/cabernet']
def add_items(d, items):
if len(items) == 1:
if items[0] in d:
return
else:
d[items[0]] = dict()
else:
if items[0] not in d:
d[items[0]] = dict()
add_items(d[items[0]], items[1:])
out = dict()
for item in x:
items = item.split("/")
add_items(out, items)
print(out)
{'wines': {'red': {'cabernet': {}}}, 'beers': {'ipa': {'stone': {}}}}

Just go through you list and add names of each strings that you split on the slash character. You can use setdefault() to ensure that the next level dictionary exist (i.e. auto create entries as you go)
strings = ['beers', 'wines', 'beers/ipa/stone', 'wines/red/cabernet']
directory = dict()
for path in strings:
d = directory
for name in path.split("/"):
d = d.setdefault(name,dict())
print(directory)
{'beers':
{ 'ipa': {'stone': {}} },
'wines':
{'red': {'cabernet': {}} }
}
With Python 3, the order of items in each dict will correspond to their original relative order in the list of strings
If you want the items in each dictionary to appear in alphanumerical order, you can change the loop like this:
directory = dict()
for path in sorted(s.split("/") for s in strings):
d = directory
for name in path:
d = d.setdefault(name,dict())
If you like recursive functions, here's a simple one that does the same thing (but less efficiently):
def makeTree(strings, separator="/", tree=None):
tree = tree or dict()
for name,*subs in (s.split(separator,1) for s in strings):
tree[name] = makeTree(subs, separator, tree.get(name))
return tree
d = makeTree(strings)
print(d)
{'beers':
{ 'ipa': {'stone': {}} },
'wines':
{'red': {'cabernet': {}} }
}

How to automatically convert AWS types to JSON [duplicate]

Python Lambda function that gets invoked for a dynamodb stream has JSON that has DynamoDB format (contains the data types in JSON). I would like to covert DynamoDB JSON to standard JSON. PHP and nodejs have Marshaler that can do this. Please let me know if there are similar or other options for Python.
DynamoDB_format = `{"feas":
{"M": {
"fea": {
"L": [
{
"M": {
"pre": {
"N": "1"
},
"Li": {
"N": "1"
},
"Fa": {
"N": "0"
},
"Mo": {
"N": "1"
},
"Ti": {
"S": "20160618184156529"
},
"Fr": {
"N": "4088682"
}
}
}
]
}
}
}
}`

Update: There is a library now: https://pypi.org/project/dynamodb-json/
Here is an improved version of indiangolfer's answer.
While #indiangolfer's solution works for the question, this improved version might be more useful for others who stumble upon this thread.
def unmarshal_dynamodb_json(node):
data = dict({})
data['M'] = node
return _unmarshal_value(data)
def _unmarshal_value(node):
if type(node) is not dict:
return node
for key, value in node.items():
# S – String - return string
# N – Number - return int or float (if includes '.')
# B – Binary - not handled
# BOOL – Boolean - return Bool
# NULL – Null - return None
# M – Map - return a dict
# L – List - return a list
# SS – String Set - not handled
# NN – Number Set - not handled
# BB – Binary Set - not handled
key = key.lower()
if key == 'bool':
return value
if key == 'null':
return None
if key == 's':
return value
if key == 'n':
if '.' in str(value):
return float(value)
return int(value)
if key in ['m', 'l']:
if key == 'm':
data = {}
for key1, value1 in value.items():
if key1.lower() == 'l':
data = [_unmarshal_value(n) for n in value1]
else:
if type(value1) is not dict:
return _unmarshal_value(value)
data[key1] = _unmarshal_value(value1)
return data
data = []
for item in value:
data.append(_unmarshal_value(item))
return data
It is improved in the following ways:
handles more data types, including lists, which were not handled correctly previously
handles lowercase and uppercase keys
Edit: fix recursive object bug

I couldn't find anything out in the wild. So, I decided to port the PHP implementation of dynamodb json to standard json that was published here. I tested this in a python lambda function processing DynamoDB stream. If there is a better way to do this, please let me know.
(PS: This is not a complete port of PHP Marshaler)
The JSON in the question gets transformed to:
{
"feas":{
"fea":[
{
"pre":"1",
"Mo":"1",
"Ti":"20160618184156529",
"Fa":"0",
"Li":"1",
"Fr":"4088682"
}
]
}
}
def unmarshalJson(node):
data = {}
data["M"] = node
return unmarshalValue(data, True)
def unmarshalValue(node, mapAsObject):
for key, value in node.items():
if(key == "S" or key == "N"):
return value
if(key == "M" or key == "L"):
if(key == "M"):
if(mapAsObject):
data = {}
for key1, value1 in value.items():
data[key1] = unmarshalValue(value1, mapAsObject)
return data
data = []
for item in value:
data.append(unmarshalValue(item, mapAsObject))
return data

To easily convert to and from the DynamoDB JSON I recommend using the boto3 dynamodb types serializer and deserializer.
import boto3
from boto3.dynamodb.types import TypeSerializer, TypeDeserializer
ts= TypeSerializer()
td = TypeDeserializer()
data= {"id": "5000"}
serialized_data= ts.serialize(data)
print(serialized_data)
#{'M': {'id': {'S': '5000'}}}
deserialized_data= td.deserialize(serialized_data)
print(deserialized_data)
#{'id': '5000'}
For more details check out the boto3.dynamodb.types classes.

As taken from this blog, the following seems to be the simplest solution:
from boto3.dynamodb.types import TypeDeserializer, TypeSerializer
def unmarshall(dynamo_obj: dict) -> dict:
"""Convert a DynamoDB dict into a standard dict."""
deserializer = TypeDeserializer()
return {k: deserializer.deserialize(v) for k, v in dynamo_obj.items()}
def marshall(python_obj: dict) -> dict:
"""Convert a standard dict into a DynamoDB ."""
serializer = TypeSerializer()
return {k: serializer.serialize(v) for k, v in python_obj.items()}

import json
import boto3
import base64
output = []
def lambda_handler(event, context):
print(event)
for record in event['records']:
payload = base64.b64decode(record['data']).decode('utf-8')
print('payload:', payload)
row_w_newline = payload + "\n"
print('row_w_newline type:', type(row_w_newline))
row_w_newline = base64.b64encode(row_w_newline.encode('utf-8'))
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': row_w_newline
}
output.append(output_record)
print('Processed {} records.'.format(len(event['records'])))
return {'records': output}

Use Python and JSON to recursively get all keys associated with a value

Giving data organized in JSON format (code example bellow) how can we get the path of keys and sub-keys associated with a given value?
i.e.
Giving an input "23314" we need to return a list with:
Fanerozoico, Cenozoico, Quaternario, Pleistocenico, Superior.
Since data is a json file, using python and json lib we had decoded it:
import json
def decode_crono(crono_file):
with open(crono_file) as json_file:
data = json.load(json_file)
Now on we do not know how to treat it in a way to get what we need.
We can access keys like this:
k = data["Fanerozoico"]["Cenozoico"]["Quaternario "]["Pleistocenico "].keys()
or values like this:
v= data["Fanerozoico"]["Cenozoico"]["Quaternario "]["Pleistocenico "]["Superior"].values()
but this is still far from what we need.
{
"Fanerozoico": {
"id": "20000",
"Cenozoico": {
"id": "23000",
"Quaternario": {
"id": "23300",
"Pleistocenico": {
"id": "23310",
"Superior": {
"id": "23314"
},
"Medio": {
"id": "23313"
},
"Calabriano": {
"id": "23312"
},
"Gelasiano": {
"id": "23311"
}
}
}
}
}
}

It's a little hard to understand exactly what you are after here, but it seems like for some reason you have a bunch of nested json and you want to search it for an id and return a list that represents the path down the json nesting. If so, the quick and easy path is to recurse on the dictionary (that you got from json.load) and collect the keys as you go. When you find an 'id' key that matches the id you are searching for you are done. Here is some code that does that:
def all_keys(search_dict, key_id):
def _all_keys(search_dict, key_id, keys=None):
if not keys:
keys = []
for i in search_dict:
if search_dict[i] == key_id:
return keys + [i]
if isinstance(search_dict[i], dict):
potential_keys = _all_keys(search_dict[i], key_id, keys + [i])
if 'id' in potential_keys:
keys = potential_keys
break
return keys
return _all_keys(search_dict, key_id)[:-1]
The reason for the nested function is to strip off the 'id' key that would otherwise be on the end of the list.
This is really just to give you an idea of what a solution might look like. Beware the python recursion limit!

Based on the assumption that you need the full dictionary path until a key named id has a particular value, here's a recursive solution that iterates the whole dict. Bear in mind that:
The code is not optimized at all
For huge json objects it might yield StackOverflow :)
It will stop at first encountered value found (in theory there shouldn't be more than 1 if the json is semantically correct)
The code:
import json
from types import DictType
SEARCH_KEY_NAME = "id"
FOUND_FLAG = ()
CRONO_FILE = "a.jsn"
def decode_crono(crono_file):
with open(crono_file) as json_file:
return json.load(json_file)
def traverse_dict(dict_obj, value):
for key in dict_obj:
key_obj = dict_obj[key]
if key == SEARCH_KEY_NAME and key_obj == value:
return FOUND_FLAG
elif isinstance(key_obj, DictType):
inner = traverse_dict(key_obj, value)
if inner is not None:
return (key,) + inner
return None
if __name__ == "__main__":
value = "23314"
json_dict = decode_crono(CRONO_FILE)
result = traverse_dict(json_dict, value)
print result

JSON get key path in nested dictionary

json = '{
"app": {
"Garden": {
"Flowers": {
"Red flower": "Rose",
"White Flower": "Jasmine",
"Yellow Flower": "Marigold"
}
},
"Fruits": {
"Yellow fruit": "Mango",
"Green fruit": "Guava",
"White Flower": "groovy"
},
"Trees": {
"label": {
"Yellow fruit": "Pumpkin",
"White Flower": "Bogan"
}
}
}'
Here is my json string, which keeps on changing frquently so the keys position within the dictionary is not same everytime, i need to
search for a key and print it corresponding value, Since the json string changes everytime I have written an recursive function(See below) to search
for key in the new json string and print the value. However now the situation is we have same key multiple times with diff values, how can
i get the complete path of the key so it would be easier to understand which key value it is, for example the result should be like this:
app.Garden.Flowers.white Flower = Jasmine
app.Fruits.White Flower = groovy
app.Trees.label.White Flower = Bogan
My code so far:
import json
with open('data.json') as data_file:
j = json.load(data_file)
# j=json.loads(a)
def find(element, JSON):
if element in JSON:
print JSON[element].encode('utf-8')
for key in JSON:
if isinstance(JSON[key], dict):
find(element, JSON[key])
find(element to search,j)

You could add a string parameter that keeps track of the current JSON path. Something like the following could work:
def find(element, JSON, path, all_paths):
if element in JSON:
path = path + element + ' = ' + JSON[element].encode('utf-8')
print path
all_paths.append(path)
for key in JSON:
if isinstance(JSON[key], dict):
find(element, JSON[key],path + key + '.',all_paths)
You would call it like this:
all_paths = []
find(element_to_search,j,'',all_paths)

def getDictValueFromPath(listKeys, jsonData):
"""
>>> mydict = {
'a': {
'b': {
'c': '1'
}
}
}
>>> mykeys = ['a', 'b']
>>> getDictValueFromPath(mykeys, mydict)
{'c': '1'}
"""
localData = jsonData.copy()
for k in listKeys:
try:
localData = localData[k]
except:
return None
return localData
gist

The following code snippet will give a list of paths that are accessible in the JSON. It's following the convention of JSON paths where [] signifies that the path is a list.
def get_paths(source):
paths = []
if isinstance(source, collections.abc.MutableMapping):
for k, v in source.items():
if k not in paths:
paths.append(k)
for x in get_paths(v):
if k + '.' + x not in paths:
paths.append(k + '.' + x)
elif isinstance(source, collections.abc.Sequence) and not isinstance(source, str):
for x in source:
for y in get_paths(x):
if '[].' + y not in paths:
paths.append('[].' + y)
return paths

Here is a modified version of Brian's answer that supports lists and returns the result:
def find(element, JSON, path='', all_paths=None):
all_paths = [] if all_paths is None else all_paths
if isinstance(JSON, dict):
for key, value in JSON.items():
find(element, value, '{}["{}"]'.format(path, key), all_paths)
elif isinstance(JSON, list):
for index, value in enumerate(JSON):
find(element, value, '{}[{}]'.format(path, index), all_paths)
else:
if JSON == element:
all_paths.append(path)
return all_paths
Usage:
find(JSON, element)

convert a list of delimited strings to a tree/nested dict, using python

I am trying to convert a list of dot-separated strings, e.g.
['one.two.three.four', 'one.six.seven.eight', 'five.nine.ten', 'twelve.zero']
into a tree (nested lists or dicts - anything that is easy to walk through).
The real data happens to have 1 to 4 dot-separated parts of different length and has 2200 records in total.
My actual goal is to fill in the set of 4 QComboBox'es with this data, in manner that the 1st QComboBox is filled with first set items ['one', 'five', 'twelve'] (no duplicates). Then depending on the chosen item, the 2nd QComboBox is filled with its related items: for 'one' it would be: ['two', 'six'], and so on, if there's another nested level.
So far I've got a working list -> nested dicts solution, but it's horribly slow, since I use regular dict(). And I seem to have a trouble to redesign it to a defaultdict in a way to easily work out filling the ComboBoxes properly.
My current code:
def list2tree(m):
tmp = {}
for i in range(len(m)):
if m.count('.') == 0:
return m
a = m.split('.', 1)
try:
tmp[a[0]].append(list2tree(a[1]))
except (KeyError, AttributeError):
tmp[a[0]] = list2tree(a[1])
return tmp
main_dict = {}
i = 0
for m in methods:
main_dict = list2tree(m)
i += 1
if (i % 100) == 0: print i, len(methods)
print main_dict, i, len(methods)

ls = ['one.two.three.four', 'one.six.seven.eight', 'five.nine.ten', 'twelve.zero']
tree = {}
for item in ls:
t = tree
for part in item.split('.'):
t = t.setdefault(part, {})
Result:
{
"twelve": {
"zero": {}
},
"five": {
"nine": {
"ten": {}
}
},
"one": {
"six": {
"seven": {
"eight": {}
}
},
"two": {
"three": {
"four": {}
}
}
}
}

While this is beyond the reach of the original question, some comments mentioned a form of this algorithm that incorporates values. I came up with this to that end:
def dictionaryafy(self, in_dict):
tree = {}
for key, value in in_dict.items():
t = tree
parts = key.split(".")
for part in parts[:-1]:
t = t.setdefault(part, {})
t[parts[-1]] = value
return tree

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark - get attribute names from json file - python

Related

How to create directory tree from a list of strings in Python?

How to automatically convert AWS types to JSON [duplicate]

Use Python and JSON to recursively get all keys associated with a value

JSON get key path in nested dictionary

convert a list of delimited strings to a tree/nested dict, using python

Categories

Resources