Unpacking nested dictionary list within python - python

I would be very grateful if someone could suggest a more Pythonic way of handling the following issue:
Problem:
I have a json object parsed into a python object (dict). The issue I have is that the json object structure is a list of dictionaries(dict1). These dictionaries contain a dictionary(dict2).
I would like to parse all the content of dict1 and combine the contents of dict2 within dict1.
Thereafter, I would like to parse this into pandas.
json_object = {
"data": [{
"complete": "true",
"data_two": {
"a": "5",
"b": "6",
"c": "6",
"d": "8"
},
"time": "2016-10-17",
"End_number": 2
},
{
"complete": "true",
"data_two": {
"a": "11",
"b": "21",
"c": "31",
"d": "41"
},
"time": "2016-10-17",
"End_number": 1
}
],
"Location": "DE",
"End Zone": 5
}
My attempt:
dataList = json_object['data']
Unpacked_Data = [(d['time'],d['End_number'], d['data_two'].keys(),d['data_two'].values()) for d in dataList]
Unpacked_Data is a list of tuples that now contains (time, end_number, [List of keys], [list of values])
To use this in a Pandas dataframe I would then need to unpack the two lists within my tuple. --> is there an easy way to unpack lists within a tuple?
Is there a better and more elegant/Pythonic way of approaching this problem?
Thanks,
12avi

One way (using pandas) is to start by putting everything into a dataframe, then apply pd.Series to it:
df = pd.DataFrame(Unpacked_Data)
unpacked0 = df[2].apply(lambda x: pd.Series(list(x)))
unpacked1 = df[3].apply(lambda x: pd.Series(list(x)))
pd.concat((df[[0,1]],unpacked0,unpacked1))
The other way is to use list comprehension and argument unpacking:
df = pd.DataFrame([[a,b,*c,*d] for a,b,c,d in Unpacked_Data])
However, the second method may not line up the way you want it if the packed lists are not of the same length.

Related

Using pandas.json_normalize to "unfold" a dictionary of a list of dictionaries

I am new to Python (and coding in general) so I'll do my best to explain the challenge I'm trying to work through.
I'm working with a large dataset which was exported as a CSV from a database. However, there is one column within this CSV export that contains a nested list of dictionaries (as best as I can tell). I've looked around extensively online for a solution, including on Stackoverflow, but haven't quite gotten a full solution. I think I understand conceptually what I'm trying to accomplish, but not clear as to the best method or data prepping process to use.
Here is an example of the data (pared down to just the two columns I'm interested in):
{
"app_ID": {
"0": 1abe23574,
"1": 4gbn21096
},
"locations": {
"0": "[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876"
},
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"}
]",
"1": "[ ]",
]"
}
}
Basically each app_ID can have multiple locations tied to a single ID, or it can be empty as seen above. I have attempted using some guides I found online using Panda's json_normalize() function to "unfold" or get the list of dictionaries into their own rows in a Panda dataframe.
I'd like to end up with something like this:
loc_id lat long app_ID
abc1 12.3456 101.9876 1abe23574
abc1 45.7890 102.6543 1abe23574
etc...
I am learning about how to use the different functions of json_normalize, like "record_path" and "meta", but haven't been able to get it to work yet.
I have tried loading the json file into a Jupyter Notebook using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data, record_path = ['locations'])
but it only creates a dataframe that is 1 row and multiple columns long, where I'd like to have multiple rows generated from the inner-most dictionary that tie back to the app_ID and loc_ID fields.
Attempt at a solution:
I was able to get close to the dataframe format I wanted using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['locations']['0'])
but that would then require some kind of iteration through the list in order to create a dataframe, and then I'd lose the connection to the app_ID fields. (As best as I can understand how the json_normalize function works).
Am I on the right track trying to use json_normalize, or should I start over again and try a different route? Any advice or guidance would be greatly appreciated.
I can't say that suggesting you using convtools library is a good thing since you are a beginner, because this library is almost like another Python over the Python. It helps to dynamically define data conversions (generating Python code under the hood).
But anyway, here is the code if I understood the input data right:
import json
from convtools import conversion as c
data = {
"app_ID": {"0": "1abe23574", "1": "4gbn21096"},
"locations": {
"0": """[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876" },
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"} ]""",
"1": "[ ]",
},
}
# define it once and use multiple times
converter = (
c.join(
# converts "app_ID" data to iterable of dicts
(
c.item("app_ID")
.call_method("items")
.iter({"id": c.item(0), "app_id": c.item(1)})
),
# converts "locations" data to iterable of dicts,
# where each id like "0" is zipped to each location.
# the result is iterable of dicts like {"id": "0", "loc": {"loc_id": ... }}
(
c.item("locations")
.call_method("items")
.iter(
c.zip(id=c.repeat(c.item(0)), loc=c.item(1).pipe(json.loads))
)
.flatten()
),
# join on "id"
c.LEFT.item("id") == c.RIGHT.item("id"),
how="full",
)
# process results, where 0 index is LEFT item, 1 index is the RIGHT one
.iter(
{
"loc_id": c.item(1, "loc", "loc_id", default=None),
"lat": c.item(1, "loc", "lat", default=None),
"long": c.item(1, "loc", "long", default=None),
"app_id": c.item(0, "app_id"),
}
)
.as_type(list)
.gen_converter()
)
result = converter(data)
assert result == [
{'loc_id': 'abc1', 'lat': '12.3456', 'long': '101.9876', 'app_id': '1abe23574'},
{'loc_id': 'abc2', 'lat': '45.7890', 'long': '102.6543', 'app_id': '1abe23574'},
{'loc_id': None, 'lat': None, 'long': None, 'app_id': '4gbn21096'}
]

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

How to clear an array and reset the values in a for loop while building a Json string?

I am looping through each row in an excel sheet using the openpyxl import to ultimately build a large Json string that i can feed to an API.
I am looping through each row and building out my json structure, I need to split a cell value by " || " and then for each value it needs to be added as a nested array inside a json section. I currently am using the following code and my problem is that I build my list object in my for loop and append the json chunk to a larger array and it keeps appending my list values during each loop. So i used the .Clear() method on the list to clear it after each loop...but then when i compile my final output my list is empty. Its like it does not maintain its values when it is added to the list each loop. I am new to Python and gave it a good whirl. Any suggestions in the right direction would be appreciated. Its almost like each loop needs its own unique array to use and keep the values. The tags section of the Json is emptied in the final output for each json line...when it should have the values for each unique iteration in it.
My Data Set (i have 3 rows in excel). You can see that i have values that i want to split in the 7th column. That is the column i am looping through to split the values as they will be nested in my json.
Row 1 (cells) = "ABC","Testing","Testing Again","DATE","DATE",Empty,"A || B || C".
Row 2 (cells) = "ABC 2","Testing 2","Testing Again 2","DATE","DATE",Empty,"X || Y || Z".
Row 3 (cells) = "ABC 3","Testing 3","Testing Again 3","DATE","DATE",Empty,Empty.
My Code.
#from openpyxl import Workbook
import json
from openpyxl import load_workbook
output_table = input_table.copy()
var_path_excel_file = flow_variables['Location']
workbook = load_workbook(filename=var_path_excel_file)
sheet = workbook.active
#create a null value to be used
emptyString = "Null"
#list out all of the sections of the json that we want to print out - these are based on the inputs
jsonFull = []
jsondata = {}
tags = []
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
data = value[6].split(" || ")
for temp in data:
tags.append(temp)
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)
tags.clear()
print(json.dumps(jsonFull))
And then my desired outcome would be something like this. I just need to figure out the proper syntax for the list handling...and can't seem to find an example to base off of.
[
{
"name": "ABC",
"short_description": "Testing",
"long_description": "Testing Again",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"A",
"B",
"C"
]
},
{
"name": "ABC 2",
"short_description": "Testing 2",
"long_description": "Testing Again 2",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"X",
"Y",
"Z"
]
},
{
"name": "ABC 3",
"short_description": "Testing 3",
"long_description": "Testing Again 3",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
]
}
]
You're not making a copy of tags when you put it into the dictionary or call tags.clear(), you're just putting a reference to the same list. You need to create a new list at the beginning of each loop iteration, not reuse the same list.
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
tags = value[6].split(" || ")
else:
tags = []
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)

List key values for Json data file

I have a very long json file, that I need make sense of in order to query the correct data that is related to what I am interested in. In order to do this, I would like to extract all of the key values in order to know what is available to query. Is there an quick way of doing this, or should I just write a parser that traverses the json file and extracts anything in-between either { and : or , and : ?
given the example:
[{"Name": "key1", "Value": "value1"}, {"Name": "key2", "Value": "value2"}]
I am looking for the values:
"Name"
"Value"
That will depend on if there's any nesting. But the basic pattern is something like this:
import json
with open("foo.json", "r") as fh:
data = json.load(fh)
all_keys = set()
for datum in data:
keys = set(datum.keys())
all_keys.update(keys)
This:
dict = [{"Name": "key1", "Value": "value1"}, {"Name": "key2", "Value": "value2"}]
for val in dict:
print(val.keys())
gives you:
dict_keys(['Name', 'Value'])
dict_keys(['Name', 'Value'])

Original dict/json from a pd.io.json.json_normalize() dataframe row

I have a pandas dataframe with rows created from dicts, using pd.io.json.json_normalize(). The values(not the keys/columns names) in dataframe have been modified. I want to retrieve a dict, with the same nested format the original dict has, from a row of the dataframe.
sample = {
"A": {
"a": 7
},
"B": {
"a": "name",
"z":{
"dD": 20 ,
"f_f": 3 ,
}
}
}
df = pd.io.json.json_normalize(sample, sep='__')
as expected df.columns returns me:
Index(['A__a', 'B__a', 'B__z__dD', 'B__z__f_f'], dtype='object')
I want to "reverse" the process now.
I can guarantee no string in the original dict(key or value) has a '__' as a substring and neither starts or ends with '_'

Categories

Resources