Create a data frame from a complex nested dictionary? - python

I have a big nested, then nested then nested json file saved as .txt format. I need to access some specific key pairs and crate a data frame or another transformed json object for further use. Here is a small sample with 2 key pairs.
[
{
"ko_id": [819752],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
},
{
"ko_id": [819753],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
}
]
The following code load the data as list but I need to access to the data probably as a dictionary and I need the "ko_id", "uri" and "prefLabel" from each key pair and put it to a pandas data frame or a dictionary for further analysis.
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
The following code gives me the exact value of the first element. But donot actually know how to put it together and build the ultimate algorithm to create the dataframe.
print(sample_dict["ko_id"][0])
print(sample_dict["concepts"][0]["prefLabel"][0])
print(sample_dict["concepts"][0]["uri"][0])

for record in sample_dict:
df = pd.DataFrame(record['concepts'])
df['ko_id'] = record['ko_id']
final_df = final_df.append(df)

You can pass the data to pandas.DataFrame using a generator:
import pandas as pd
import json as js
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
df = pd.DataFrame(data = ((key["ko_id"][0],
key["concepts"][0]["prefLabel"][0],
key["concepts"][0]["uri"][0]) for key in json_sample),
columns = ("ko_id", "prefLabel", "uri"))
Output:
>>> df
ko_id prefLabel uri
0 819752 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116
1 819753 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116

Related

Pyspark - Dynamically create schema from json files

I'm using Spark on Databricks notebooks to ingest some data from API call.
I start off by reading all the data from API response into a dataframe called df. But, I only need to few columns from API response, not all of them and also
I store the required columns and their data types in a json file
{
"structure": [
{
"column_name": "column1",
"column_type": "StringType()"
},
{
"column_name": "column2",
"column_type": "IntegerType()"
},
{
"column_name": "column3",
"column_type": "DateType()"
},
{
"column_name": "column4",
"column_type": "StringType()"
}
]
}
And then I'm building the schema using following code
with open("/dbfs/mnt/datalake/Dims/shema_json","r") as read_handle:
file_contents = json.load(read_handle)
struct_fields = []
for column in file_contents.get("structure"):
struct_fields.append(f'StructField("{column.get("column_name")}",{column.get("column_type")},True)')
new_schema = StructType(struct_fields)
Then finally, I want to create a dataframe with required columns with correct data types using this code
df_staging = spark.createDataFrame(df.rdd,schema = new_schema)
But, when I do this, I get an error message saying 'str' object has no attribute 'name'
To get a subset of columns from a dataframe you can use a simple select combined with cast:
import importlib
cols=[f"cast({c['column_name']} as {getattr(importlib.import_module('pyspark.sql.types'), c['column_type'].replace('()',''))().simpleString()})" for c in file_contents['structure']]
df.selectExpr(*cols).show()

Convert Embedded JSON Dict To Panda DataFrame Where Columns Headers Are Seperate From Values

I'm trying to create a python pandas DataFrame out of a JSON dictionary. The embedding is tripping me up.
The column headers are in a different section of the JSON file to the values.
The json looks similar to below. There is one section of column headers and multiple sections of data.
I need each column filled with the data that relates to it. So value_one in each case will fill the column under header_one and so on.
I have come close, but can't seem to get it to spit out the dataframe as described.
{
"my_data": {
"column_headers": [
"header_one",
"header_two",
"header_three"
],
"values": [
{
"data": [
"value_one",
"value_two",
"value_three"
]
},
{
"data": [
"value_one",
"value_two",
"value_three"
]
}
]
}
}
Assuming your dictionary is my_dict, try:
>>> pd.DataFrame(data=[d["data"] for d in my_dict["my_data"]["values"]],
columns=my_dict["my_data"]["column_headers"])

Deeply nested json - a list within a dictionary to Pandas DataFrame

I'm trying to parse nested json results.
data = {
"results": [
{
"components": [
{
"times": {
"periods": [
{
"fromDayOfWeek": 0,
"fromHour": 12,
"fromMinute": 0,
"toDayOfWeek": 4,
"toHour": 21,
"toMinute": 0,
"id": 156589,
"periodId": 20855
}
],
}
}
],
}
],
}
I can get to and create dataframes for "results" and "components" lists, but cannot get to "periods" due to the "times" dict. So far I have this:
df = pd.json_normalize(data, record_path = ['results','components'])
Need a separate "periods" dataframe with the included column names and values. Would appreciate your help on this. Thank you!
I results
II components
III times
IIII periods
The normalize should be correct way:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
There is 4 level of nesting. There can be x components in results and y times in components - however that type of nesting is overengineering?
The simplest way of getting data is:
print data['a']['b']['c']['d'] (...)
in your case:
print data['results']['components']['times']['periods']
You can access the specific nested level by this piece of code:
def GetPropertyFromPeriods (property):
propertyList = []
for x in data['results']['components']['times']:
singleProperty = photoURL['periods'][property]
propertyList.append(singleProperty)
return propertyList
This give you access to one property inside periods (fromDayOfWeek, fromHour, fromMinute)
After coverting json value, transform it into pandas dataframe:
print pd.DataFrame(data, columns=["columnA", "columnB”])
If stuck:
How to Create a table with data from JSON output in Python
Python - How to convert JSON File to Dataframe
pandas documentation:
pandas.DataFrame.from_dict
pandas.json_normalize

pandas change the order of columns

In my project I'm using flask I get a JSON (by REST API) that has data that I should convert to a pandas Dataframe.
The JSON looks like:
{
"entity_data":[
{"id": 1, "store": "a", "marker": "a"}
]
}
I get the JSON and extract the data:
params = request.json
entity_data = params.pop('entity_data')
and then I convert the data into a pandas dataframe:
entity_ids = pd.DataFrame(entity_data)
the result looks like this:
id marker store
0 1 a a
This is not the original order of the columns. I'd like to change the order of the columns as in the dictionary.
help?
Use OrderedDict for an ordered dictionary
You should not assume dictionaries are ordered. While dictionaries are insertion ordered in Python 3.7, whether or not libraries maintain this order when reading json into a dictionary, or converting the dictionary to a Pandas dataframe, should not be assumed.
The most reliable solution is to use collections.OrderedDict from the standard library:
import json
import pandas as pd
from collections import OrderedDict
params = """{
"entity_data":[
{"id": 1, "store": "a", "marker": "a"}
]
}"""
# replace myjson with request.json
data = json.loads(params, object_pairs_hook=OrderedDict)
entity_data = data.pop('entity_data')
df = pd.DataFrame(entity_data)
print(df)
# id store marker
# 0 1 a a
Just add the column names parameter.
entity_ids = pd.DataFrame(entity_data, columns=["id","store","marker"])
Assuming you have access to JSON sender, you can send the order in the JSON itself.
like
`{
"order":['id','store','marker'],
"entity_data":{"id": [1,2], "store": ["a","b"],
"marker": ["a","b"]}
}
then create DataFrame with columns specified. as said by Chiheb.K.
import pandas as pd
params = request.json
entity_data = params.pop('entity_data')
order = params.pop('order')
entity_df=pd.DataFrame(data,columns=order)
if you cannot explicitly specify the order in the JSON. see this answer to specify object_pairs_hook in
JSONDecoder to get an OrderedDict and then create the DataFrame

JSON to Pandas: is there a more elegant solution?

I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html

Categories

Resources