How to convert Nested JSON to Pandas Dataframe - python

I have the following nested json:
import requests
headers = {
'Content-Type': 'application/json'
}
r = requests.get("...API DATA...")
[
{
"ticker":"btcusd",
"baseCurrency":"btc",
"quoteCurrency":"usd",
"priceData":[
{
"open":3914.749407813885,
"high":3942.374263716895,
"low":3846.1755315352952,
"close":3849.1217299601617,
"date":"2019-01-02T00:00:00+00:00",
"tradesDone":756.0,
"volume":339.68131616889997,
"volumeNotional":1307474.735327181
}
]
}
]
I would like to convert it into Pandas data frame. I have been able to do this:
j = r.json()
df = pd.DataFrame.from_dict(j)
df
Output:
I would like to also expand 'priceData' in columns.
I have tried different methods, including json.normalise and json.loads, but there was always an error I could not understand.
Could please anyone show me how to do it in order to understand?
Thanks!
EDIT
priceData contains more than 1 element.
"ticker", "baseCurrency", and "quoteCurrency" are not essentially needed in the data frame, hence can be discarded.

Use json_normalize:
import pandas as pd
pd.json_normalize(data,'priceData',['ticker','baseCurrency','quoteCurrency'])
where,
data = [
{
"ticker":"btcusd",
"baseCurrency":"btc",
"quoteCurrency":"usd",
"priceData":[
{
"open":3914.749407813885,
"high":3942.374263716895,
"low":3846.1755315352952,
"close":3849.1217299601617,
"date":"2019-01-02T00:00:00+00:00",
"tradesDone":756.0,
"volume":339.68131616889997,
"volumeNotional":1307474.735327181
}
]
}
]
Edit: for multiple API calls
final_data = []
for _ in range(3): ## how ever many times you want to call the API
## get you data from each API call
final_data.extend(data)

Related

Pyspark - Dynamically create schema from json files

I'm using Spark on Databricks notebooks to ingest some data from API call.
I start off by reading all the data from API response into a dataframe called df. But, I only need to few columns from API response, not all of them and also
I store the required columns and their data types in a json file
{
"structure": [
{
"column_name": "column1",
"column_type": "StringType()"
},
{
"column_name": "column2",
"column_type": "IntegerType()"
},
{
"column_name": "column3",
"column_type": "DateType()"
},
{
"column_name": "column4",
"column_type": "StringType()"
}
]
}
And then I'm building the schema using following code
with open("/dbfs/mnt/datalake/Dims/shema_json","r") as read_handle:
file_contents = json.load(read_handle)
struct_fields = []
for column in file_contents.get("structure"):
struct_fields.append(f'StructField("{column.get("column_name")}",{column.get("column_type")},True)')
new_schema = StructType(struct_fields)
Then finally, I want to create a dataframe with required columns with correct data types using this code
df_staging = spark.createDataFrame(df.rdd,schema = new_schema)
But, when I do this, I get an error message saying 'str' object has no attribute 'name'
To get a subset of columns from a dataframe you can use a simple select combined with cast:
import importlib
cols=[f"cast({c['column_name']} as {getattr(importlib.import_module('pyspark.sql.types'), c['column_type'].replace('()',''))().simpleString()})" for c in file_contents['structure']]
df.selectExpr(*cols).show()

Convert Embedded JSON Dict To Panda DataFrame Where Columns Headers Are Seperate From Values

I'm trying to create a python pandas DataFrame out of a JSON dictionary. The embedding is tripping me up.
The column headers are in a different section of the JSON file to the values.
The json looks similar to below. There is one section of column headers and multiple sections of data.
I need each column filled with the data that relates to it. So value_one in each case will fill the column under header_one and so on.
I have come close, but can't seem to get it to spit out the dataframe as described.
{
"my_data": {
"column_headers": [
"header_one",
"header_two",
"header_three"
],
"values": [
{
"data": [
"value_one",
"value_two",
"value_three"
]
},
{
"data": [
"value_one",
"value_two",
"value_three"
]
}
]
}
}
Assuming your dictionary is my_dict, try:
>>> pd.DataFrame(data=[d["data"] for d in my_dict["my_data"]["values"]],
columns=my_dict["my_data"]["column_headers"])

Deeply nested json - a list within a dictionary to Pandas DataFrame

I'm trying to parse nested json results.
data = {
"results": [
{
"components": [
{
"times": {
"periods": [
{
"fromDayOfWeek": 0,
"fromHour": 12,
"fromMinute": 0,
"toDayOfWeek": 4,
"toHour": 21,
"toMinute": 0,
"id": 156589,
"periodId": 20855
}
],
}
}
],
}
],
}
I can get to and create dataframes for "results" and "components" lists, but cannot get to "periods" due to the "times" dict. So far I have this:
df = pd.json_normalize(data, record_path = ['results','components'])
Need a separate "periods" dataframe with the included column names and values. Would appreciate your help on this. Thank you!
I results
II components
III times
IIII periods
The normalize should be correct way:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
There is 4 level of nesting. There can be x components in results and y times in components - however that type of nesting is overengineering?
The simplest way of getting data is:
print data['a']['b']['c']['d'] (...)
in your case:
print data['results']['components']['times']['periods']
You can access the specific nested level by this piece of code:
def GetPropertyFromPeriods (property):
propertyList = []
for x in data['results']['components']['times']:
singleProperty = photoURL['periods'][property]
propertyList.append(singleProperty)
return propertyList
This give you access to one property inside periods (fromDayOfWeek, fromHour, fromMinute)
After coverting json value, transform it into pandas dataframe:
print pd.DataFrame(data, columns=["columnA", "columnB”])
If stuck:
How to Create a table with data from JSON output in Python
Python - How to convert JSON File to Dataframe
pandas documentation:
pandas.DataFrame.from_dict
pandas.json_normalize

Create a data frame from a complex nested dictionary?

I have a big nested, then nested then nested json file saved as .txt format. I need to access some specific key pairs and crate a data frame or another transformed json object for further use. Here is a small sample with 2 key pairs.
[
{
"ko_id": [819752],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
},
{
"ko_id": [819753],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
}
]
The following code load the data as list but I need to access to the data probably as a dictionary and I need the "ko_id", "uri" and "prefLabel" from each key pair and put it to a pandas data frame or a dictionary for further analysis.
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
The following code gives me the exact value of the first element. But donot actually know how to put it together and build the ultimate algorithm to create the dataframe.
print(sample_dict["ko_id"][0])
print(sample_dict["concepts"][0]["prefLabel"][0])
print(sample_dict["concepts"][0]["uri"][0])
for record in sample_dict:
df = pd.DataFrame(record['concepts'])
df['ko_id'] = record['ko_id']
final_df = final_df.append(df)
You can pass the data to pandas.DataFrame using a generator:
import pandas as pd
import json as js
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
df = pd.DataFrame(data = ((key["ko_id"][0],
key["concepts"][0]["prefLabel"][0],
key["concepts"][0]["uri"][0]) for key in json_sample),
columns = ("ko_id", "prefLabel", "uri"))
Output:
>>> df
ko_id prefLabel uri
0 819752 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116
1 819753 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116

JSON to Pandas: is there a more elegant solution?

I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html

Categories

Resources