Read multiple jsons from one file [duplicate] - python

This question already has answers here:
Loading and parsing a JSON file with multiple JSON objects
(5 answers)
Closed 3 years ago.
I am working with python and I have a file (data.json) which contains multiple jsons but the whole of it is not a json.
So the file looks like that:
{ "_id" : 01, ..., "path" : "2017-12-12" }
{ "_id" : 02, ..., "path" : "2017-1-12" }
{ "_id" : 03, ..., "path" : "2017-5-12" }
at the place of ... there are about 30 more keys which some of them have nested jsons (so my point is that each json above is pretty long).
Therefore, each of the blocks above at this single file are jsons but the whole of the file is not a json since these are not separated by commas etc.
How can I read each of these jsons separately either with pandas or with simple python?
I have tried this:
import pandas as pd
df = pd.read_json('~/Desktop/data.json', lines=True)
and it actually creates a dataframe where each row is about one json but it also create a column for each of the (1st level) keys of the json which makes things a bit more messy instead of putting the whole json directly in one cell.
To be more clear, I would like my output to be like this in a 'pandas' dataframe (or in another sensible data-structure):
jsons
0 { "_id" : 01, ..., "path" : "2017-12-12" }
1 { "_id" : 02, ..., "path" : "2017-1-12" }
2 { "_id" : 03, ..., "path" : "2017-5-12" }

Idea is use read_csv with no exist separator in data and then convert each value of column to dictionary:
import pandas as pd
import ast, json
from io import StringIO
temp=u"""{ "_id" : 1, "path" : "2017-12-12" }
{ "_id" : 2, "path" : "2017-1-12" }
{ "_id" : 3, "path" : "2017-5-12" }"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['data'])
print (df)
#jsons
df['data'] = df['data'].apply(json.loads)
#dictionaries
#df['data'] = df['data'].apply(ast.literal_eval)
print (df)
data
0 {'_id': 1, 'path': '2017-12-12'}
1 {'_id': 2, 'path': '2017-1-12'}
2 {'_id': 3, 'path': '2017-5-12'}

As the file is itself is not a json, so i will read it line by line
and as the line is a string format so i will convert it to dict type using yaml
then last i will append it all in dataframe
import yaml
import pandas as pd
f = open('data.json')
line = f.readline()
df = pd.DataFrame()
while line:
#string line to dict
d = yaml.load(line)
#temp dataframe
df1=pd.DataFrame(d,index=[0])
#append in every iteration
df=df.append(df1, ignore_index=True)
line = f.readline()
f.close()
print(df)
#output
_id path
0 01 2017-12-12
1 02 2017-1-12
2 03 2017-5-12

Related

Convert Embedded JSON Dict To Panda DataFrame Where Columns Headers Are Seperate From Values

I'm trying to create a python pandas DataFrame out of a JSON dictionary. The embedding is tripping me up.
The column headers are in a different section of the JSON file to the values.
The json looks similar to below. There is one section of column headers and multiple sections of data.
I need each column filled with the data that relates to it. So value_one in each case will fill the column under header_one and so on.
I have come close, but can't seem to get it to spit out the dataframe as described.
{
"my_data": {
"column_headers": [
"header_one",
"header_two",
"header_three"
],
"values": [
{
"data": [
"value_one",
"value_two",
"value_three"
]
},
{
"data": [
"value_one",
"value_two",
"value_three"
]
}
]
}
}
Assuming your dictionary is my_dict, try:
>>> pd.DataFrame(data=[d["data"] for d in my_dict["my_data"]["values"]],
columns=my_dict["my_data"]["column_headers"])

Deeply nested json - a list within a dictionary to Pandas DataFrame

I'm trying to parse nested json results.
data = {
"results": [
{
"components": [
{
"times": {
"periods": [
{
"fromDayOfWeek": 0,
"fromHour": 12,
"fromMinute": 0,
"toDayOfWeek": 4,
"toHour": 21,
"toMinute": 0,
"id": 156589,
"periodId": 20855
}
],
}
}
],
}
],
}
I can get to and create dataframes for "results" and "components" lists, but cannot get to "periods" due to the "times" dict. So far I have this:
df = pd.json_normalize(data, record_path = ['results','components'])
Need a separate "periods" dataframe with the included column names and values. Would appreciate your help on this. Thank you!
I results
II components
III times
IIII periods
The normalize should be correct way:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
There is 4 level of nesting. There can be x components in results and y times in components - however that type of nesting is overengineering?
The simplest way of getting data is:
print data['a']['b']['c']['d'] (...)
in your case:
print data['results']['components']['times']['periods']
You can access the specific nested level by this piece of code:
def GetPropertyFromPeriods (property):
propertyList = []
for x in data['results']['components']['times']:
singleProperty = photoURL['periods'][property]
propertyList.append(singleProperty)
return propertyList
This give you access to one property inside periods (fromDayOfWeek, fromHour, fromMinute)
After coverting json value, transform it into pandas dataframe:
print pd.DataFrame(data, columns=["columnA", "columnB”])
If stuck:
How to Create a table with data from JSON output in Python
Python - How to convert JSON File to Dataframe
pandas documentation:
pandas.DataFrame.from_dict
pandas.json_normalize

Import hierarchichal JSON file into Pandas dataframe

I have looked for solutions to my problem but couldn't find anything that applies. I'm trying to import a high dimension JSON file into a Pandas dataframe.
The structure is something like:
{ 'manufacturing_plant_events':
{ 'data':
{ 'shiftInformation':
{ 'shift1':
{ 'color': 'red'
, 'amount' : 32
, 'order' : None
},
'shift2':
{ 'color': 'blue'
, 'amount' : 44
, 'order' : 1
},
'shift3':
{ 'color': 'green'
, 'amount' : 98
, 'order' : 2
}
}
...}
...}
}
I have tried numerous solutions including:
json.loads()
pd.DataFrame(json)
json_normalize(json)
pd.read_json(json)
and others, I've tried flattening my array and converting it into a dataframe bu that didn't work either. I'm not sure if this is even possible or if the dataframe supports only a few levels of nested.
The flattening I've tried was to just try and create columns in a dataframe that contain the leaf information. Hence, I'm also fine with a dataframe which has the following column names the full path and the value, the actual value stored in the node.
First row in my dataframe:
(
manufacturing_plant_events.data.shiftInformation.shift1.color
'red'
manufacturing_plant_events.data.shiftInformation.shift1.amount
32
manufacturing_plant_events.data.shiftInformation.shift1.order
None
)
and so on.
Any suggestion on how to solve this is highly appreciated.
I have come up with a dataframe by flattening the dict :
import pandas as pd
def flat_dict(dictionary, prefix):
if type(dictionary) == dict:
rows = []
for key, items in dictionary.items():
rows += flat_dict(items, prefix + [key])
return rows
else:
return [prefix + [dictionary]]
def dict_to_df(dictionary):
return pd.DataFrame(flat_dict(dictionary, []))
Sure you need to import your json as a dict first thanks to json package.

Create a data frame from a complex nested dictionary?

I have a big nested, then nested then nested json file saved as .txt format. I need to access some specific key pairs and crate a data frame or another transformed json object for further use. Here is a small sample with 2 key pairs.
[
{
"ko_id": [819752],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
},
{
"ko_id": [819753],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
}
]
The following code load the data as list but I need to access to the data probably as a dictionary and I need the "ko_id", "uri" and "prefLabel" from each key pair and put it to a pandas data frame or a dictionary for further analysis.
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
The following code gives me the exact value of the first element. But donot actually know how to put it together and build the ultimate algorithm to create the dataframe.
print(sample_dict["ko_id"][0])
print(sample_dict["concepts"][0]["prefLabel"][0])
print(sample_dict["concepts"][0]["uri"][0])
for record in sample_dict:
df = pd.DataFrame(record['concepts'])
df['ko_id'] = record['ko_id']
final_df = final_df.append(df)
You can pass the data to pandas.DataFrame using a generator:
import pandas as pd
import json as js
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
df = pd.DataFrame(data = ((key["ko_id"][0],
key["concepts"][0]["prefLabel"][0],
key["concepts"][0]["uri"][0]) for key in json_sample),
columns = ("ko_id", "prefLabel", "uri"))
Output:
>>> df
ko_id prefLabel uri
0 819752 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116
1 819753 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116

JSON to Pandas: is there a more elegant solution?

I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html

Categories

Resources