I am working with json files that stores thousands or even more entries.
firstly I want to understand the data I am working with.
import json
with open("/home/xu/stock_data/stock_market_data/nasdaq/json/AAL.json", "r") as f:
data = json.load(f)
print(json.dumps(data, indent=4))
this gives me a easy to read format, but some of the "keys"(I am not familiar with the json name, so I use the word "key" as in dict objects) have thousands of values, which makes it hard to read as a whole.
I also tried:
import json
with open("/home/xu/stock_data/stock_market_data/nasdaq/json/AAL.json", "r") as f:
data = json.load(f)
df = pd.DataFrame.from_dict(data, orient="index")
print (df.info)
but got
<bound method DataFrame.info of result error
chart [{'meta': {'currency': 'USD', 'symbol': 'AAL',... None>
this result kind of shows the structure, but it ends with ... not showcasing the whole picture.
My Question:
Is there something that works like np.array.shape for json/dict/pandas, of which can show the shape of the structure?
Is there a better library usage of interpretating the json file's structure?
Edit:
Sorry perhaps my wording of my problem was misdirecting. I tried pprint, and it provided me with:
{ 'chart': { 'error': None,
'result': [ { 'events': { 'dividends': { '1406813400': { 'amount': 0.1,
'date': 1406813400},
'1414675800': { 'amount': 0.1,
'date': 1414675800},
'1423146600': { 'amount': 0.1,
'date': 1423146600},
'1430400600': { 'amount': 0.1,
'date': 1430400600},
'1438867800': { 'amount': 0.1,
'date': 1438867800},
'1446561000': { 'amount': 0.1,
'date': 1446561000},
'1454941800': { 'amount': 0.1,
'date': 1454941800},
'1462195800': { 'amount': 0.1,
'date': 1462195800},
'1470231000': { 'amount': 0.1,
'date': 1470231000},
'1478179800': { 'amount': 0.1,
'date': 1478179800},
'1486650600': { 'amount': 0.1,
'date': 1486650600},
'1494595800': { 'amount': 0.1,
'date': 1494595800},
'1502371800': { 'amount': 0.1,
'date': 1502371800},
'1510324200': { 'amount': 0.1,
'date': 1510324200},
'1517841000': { 'amount': 0.1,
'date': 1517841000},
'1525699800': { 'amount': 0.1,
'date': 1525699800},
'1533562200': { 'amount': 0.1,
'date': 1533562200},
'1541428200': { 'amount': 0.1,
'date': 1541428200},
'1549377000': { 'amount': 0.1,
'date': 1549377000},
'1557235800': { 'amount': 0.1,
'date': 1557235800},
'1565098200': { 'amount': 0.1,
'date': 1565098200},
'1572964200': { 'amount': 0.1,
'date': 1572964200},
'1580826600': { 'amount': 0.1,
'date': 1580826600}}},
'indicators': { 'adjclose': [ { 'adjclose': [ 18.19490623474121,
19.326200485229492,
19.05280113220215,
19.80699920654297,
20.268939971923828,
20.891149520874023,
20.928863525390625,
21.28710174560547,
20.88172149658203,
20.93828773498535,
20.721458435058594,
20.514055252075195,
20.466917037963867,
20.994853973388672,
20.81572914123535,
20.2595157623291,
20.155811309814453,
19.816425323486328,
20.702600479125977,
21.032560348510742,
20.740314483642578,
21.0419864654541,
21.26824951171875,
22.531522750854492,
23.266857147216797,
23.587390899658203,
25.9725284576416,
26.27420997619629,
27.150955200195312,
27.273509979248047,
27.7448787689209,
29.507808685302734,
30.92192840576172,
31.4404239654541,
31.817523956298828,
31.940074920654297,
31.676118850708008,
32.354888916015625,
31.157604217529297,
30.158300399780273,
30.63909339904785,
31.148174285888672,
30.969064712524414,
31.496990203857422,
31.01619529724121,
31.666685104370117,
32.31717300415039,
32.31717300415039,
30.497684478759766,
31.69496726989746,
32.006072998046875,
31.7326717376709,
31.940074920654297,
31.826950073242188,
31.346155166625977,
31.61954689025879,
...
...
...
#this goes on and on for the respective "keys" of the json file. which means I have to scroll down thousands of lines to find out what type of data I have.
what I am hoping to find a a solutions that outputs something like this, where it doesn't show the data itself in whole, but only shows the "keys" and maybe some additional information. as some files may literally contain many GBs of data, making it impractical to scroll through.
#this is what I am hoping to achieve.
{
"Name": {
"title": <datatype=str,len=20>,
"time_stamp":<data_type=list, len=3000>,
"closing_price":<data_type=list, len=3000>,
"high_price_of_the_day":<data_type=list, len=3000>
...
...
...
}
}
You have a few options on how to navigate this. If you want to render your data to make more informed decisions quickly, there are the built-in libraries for rendering dictionaries (see pprint) but on a personal level I recommend something that works out of the box without much configuration. I found pprintpp to be the ideal choice for any python data structure. https://pypi.org/project/pprintpp/
Simply run in your terminal:
pip3 install pprintpp
The libraries should install under C:\Users\User\AppData\Local\Programs\Python\PythonXX\Lib\site-packages\pprintpp
After that, simply do this in your code:
import json
from pprintpp import pprint
with open("/home/xu/stock_data/stock_market_data/nasdaq/json/AAL.json", "r") as f:
data = json.load(f)
pprint(data)
You can also do pprint(data, width=1) to guarantee next dictionary key goes on the next line, even if the key is short. Ie:
some_dict = {'a': 'b', 'c': {'aa': 'bb'}}
pprint(data, width=1)
Outputs:
{
'a': 'b',
'c': {
'aa': 'bb',
},
}
Hope this helped! Cheers :)
Related
This question already has answers here:
Python - How to convert JSON File to Dataframe
(5 answers)
Closed 1 year ago.
i have json in format
{
"projects":[
{
"author":{
"id":163,
"name":"MyApp",
"easy_external_id":null
},
"sum_time_entries":0,
"sum_estimated_hours":29,
"currency":"EUR",
"custom_fields":[
{
"id":42,
"name":"System",
"internal_name":null,
"field_format":"string",
"value":null
},
{
"id":40,
"name":"Short describe",
"internal_name":null,
"field_format":"string",
"value":""
}
]
}
]"total_count":1772,
"offset":0,
"limit":1
}
And I don't know how to convert this Json "completely" to a dataframe. Respectively, I just want what's in projects. But when I do this:
df = pd.DataFrame(data['projects'])
Although I only get the dataframe from projects, in some columns (for example: author or custom_fields) the format will still remain undecomposed and I would like to decompose it in these columns as well.
can anyone advise?
I expect:
author.id
author.name
author.easy_external_id
sum_time_entries
currency
custom_fields.id
custom_fields.name
etc..
163
MyApp
null
0
EUR
42
System
...
Try:
df = pd.json_normalize(data['projects'])
See documentation here.
I tried here and it works... I think the problem is in your JSON file. Try doing:
data = {'projects': [{'author': {'id': 163,
'name': 'MyApp',
'easy_external_id': None},
'sum_time_entries': 0,
'sum_estimated_hours': 29,
'currency': 'EUR',
'custom_fields': [{'id': 42,
'name': 'System',
'internal_name': None,
'field_format': 'string',
'value': None},
{'id': 40,
'name': 'Short describe',
'internal_name': None,
'field_format': 'string',
'value': ''}]}],
'total_count': 1772,
'offset': 0,
'limit': 1}
I am currently writing a scraper that reads from an API that contains a JSON. By doing response.json() it would return a dict where we could easily use the e.g response["object"]to get the value we want as I assume that converts it to a dict. The current mock data looks like this:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
What I am looking after is that the API sometimes does contain "orders -> CL" but sometime doesn't . That means that both happy path and unhappy path is what I am looking for which is the fastest way to get a data from a dict.
I have currently done something like this:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
if (
"stock" in data
and "orders" in data["stock"]
and "CL" in data["stock"]["orders"]
and "status" in data["stock"]["orders"]["CL"]
and data["stock"]["orders"]["CL"]["status"]
):
print(f'{data["stock"]["orders"]["CL"]["status"]}: {data["stock"]["orders"]["CL"]["ordered"]}')
1: -10
However my question is that I would like to know which is the fastest way to get the data from a dict if it is in the dict?
Lookups are faster in dictionaries because Python implements them using hash tables.
If we explain the difference by Big O concepts, dictionaries have constant time complexity, O(1). This is another approach using .get() method as well:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
if (data.get('stock', {}).get('orders', {}).get('CL')):
print(f'{data["stock"]["orders"]["CL"]["status"]}: {data["stock"]["orders"]["CL"]["ordered"]}')
Here is a nice writeup on lookups in Python with list and dictionary as example.
I got your point. For this question, since your stock has just 4 values it is hard to say if .get() method will work faster than using a loop or not. If your dictionary would have more items then certainly .get() would have worked much faster but since there are few keys, using loop will not make much difference.
I have the following JSON object, in which I need to post-process some labels:
{
'id': '123',
'type': 'A',
'fields':
{
'device_safety':
{
'cost': 0.237,
'total': 22
},
'device_unit_replacement':
{
'cost': 0.262,
'total': 7
},
'software_generalinfo':
{
'cost': 3.6,
'total': 10
}
}
}
I need to split the names of labels by _ to get the following hierarchy:
{
'id': '123',
'type': 'A',
'fields':
{
'device':
{
'safety':
{
'cost': 0.237,
'total': 22
},
'unit':
{
'replacement':
{
'cost': 0.262,
'total': 7
}
}
},
'software':
{
'generalinfo':
{
'cost': 3.6,
'total': 10
}
}
}
}
This is my current version, but I got stuck and not sure how to deal with the hierarchy of fields:
import json
json_object = json.load(raw_json)
newjson = {}
for x, y in json_object['fields'].items():
hierarchy = y.split("_")
if len(hierarchy) > 1:
for k in hierarchy:
newjson[k] = ????
newjson = json.dumps(newjson, indent = 4)
Here is recursive function that will process a dict and split the keys:
def splitkeys(dct):
if not isinstance(dct, dict):
return dct
new_dct = {}
for k, v in dct.items():
bits = k.split('_')
d = new_dct
for bit in bits[:-1]:
d = d.setdefault(bit, {})
d[bits[-1]] = splitkeys(v)
return new_dct
>>> splitkeys(json_object)
{'fields': {'device': {'safety': {'cost': 0.237, 'total': 22},
'unit': {'replacement': {'cost': 0.262, 'total': 7}}},
'software': {'generalinfo': {'cost': 3.6, 'total': 10}}},
'id': '123',
'type': 'A'}
I am doing a project for my school. I am trying to convert a dict to json, so i can post it as a json request.
Here's the sample dict which i am trying to get as json.
dict = {
{'subject': 'testing123',
'start': {
'dateTime': "2020-06-13T21:30:02.096Z",
"timezone": "UTC"},
'end': {
'dateTime': "2020-06-20T21:30:02.096Z",
'timezone': 'UTC'}
}
}
I tried first normally by loading it in json but failed by error 'timezone': 'UTC' is not serilizable.
so i checked on google and then i tried to dump it in json and load it in up in json as variable. i.e
dict_json = json.dumps(dict)
dict_json = json.loads(dict)
print(dict_json)
but again got the hashing error.
then i tried to search on google again and found out i have to convert them to tuple if i have multiple dictionaries not sure on this one though if this is a right finding.
dict = {
tuple({'subject': 'testing123',
'start': tuple({
'dateTime': "2020-06-13T21:30:02.096Z",
"timezone": "UTC"}),
'end': tuple({
'dateTime': "2020-06-20T21:30:02.096Z",
'timezone': 'UTC'})
})
}
then it became a set and set is also not Json serializable now, ran out of options. Need suggestion on this problem to convert this dict into a valid json format.
Remove the word tuple and the parentheses that go with it. Next every entry in the dict must have a key and a value. The first level of your dict does not have a pair like that. Here is what I believe to be your intended object:
dict = {
'subject': 'testing123',
'start': {
'dateTime': "2020-06-13T21:30:02.096Z",
'timezone': 'UTC'},
'end': {
'dateTime': "2020-06-20T21:30:02.096Z",
'timezone': 'UTC'}
}
here is the result of performing your test case on this object:
>>> dict = {
... 'subject': 'testing123',
... 'start': {
... 'dateTime': "2020-06-13T21:30:02.096Z",
... 'timezone': 'UTC'},
... 'end': {
... 'dateTime': "2020-06-20T21:30:02.096Z",
... 'timezone': 'UTC'}
... }
>>> json_dict = json.dumps(dict)
>>> dict_from_json = json.loads(json_dict)
>>> dict_from_json
{'subject': 'testing123', 'start': {'dateTime': '2020-06-13T21:30:02.096Z', 'timezone': 'UTC'}, 'end': {'dateTime': '2020-06-20T21:30:02.096Z', 'timezone': 'UTC'}}
>>> dict_from_json['subject']
'testing123'
This is an example of my json payload:
{'data':
[{
'predictionValues':
[
{'value': 0.9926338328, 'label': 1.0},
{'value': 0.0073661672, 'label': 0.0}
],
'predictionThreshold': 0.5,
'prediction': 1.0,
'rowId': 0,
'passthroughValues':
{'Id': 'AMF012-000272'}
},
{
'predictionValues':
[
{'value': 0.446989075, 'label': 1.0},
{'value': 0.553010925, 'label': 0.0}
],
'predictionThreshold': 0.5,
'prediction': 0.0,
'rowId': 1,
'passthroughValues':
{'Id': 'NSF008-000165'}
}]
}
I am trying to get a df that looks like this and can't seem to figure it out:
passthroughValues.Id predictionValues.Value_1.0 predictionValues.Value_0.0
AMF012-000272 0.9926338328 0.0073661672
NSF008-000165 0.446989075 0.553010925
just running without any perameters doesn't work
df = json_normalize(finalPredictions)
returns predictionsValues as a series
df = json_normalize(finalPredictions, ['data', 'PredictionValues'])
that only returns 0 and 1 without the needed Id to associate it back to my data
Found the answer:
result = json_normalize(finalPredictions['data'], 'predictionValues', [['passthroughValues', 'Id']])
I really only care about the positive result
result = result[result['label']==1]