How to get metadata when reading nested json with pandas - python

I'm trying to get the metadata out from a json using pandas json_normalize, but it does not work as expected.
I have a json fine with the following structure
data=[
{'a':'aa',
'b':{'b1':'bb1','b2':'bb2'},
'c':[{
'ca':[{'ca1':'caa1'
}]
}]
}]
I'd like to get the following
ca1
a
b.b1
caa1
aa
bb1
I would expect this to work
pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b','b1']])
but it doesn't find the key b1. Strangely enough if my record_path is 'c' alone it does find the key.
I feel I'm missing something here, but I can't figure out what.
I appreciate any help!

Going down first level you grab the meta as a list of columns you want to keep. Record path use a list to map levels that you want to go down. Finally column b is a dict you can apply to a Series concat back into df and pop to remove unpacked dict column.
df = pd.json_normalize(
data=data,
meta=['a', 'b'],
record_path=['c', 'ca']
)
df = pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
print(df)
Output:
ca1 a b1 b2
0 caa1 aa bb1 bb2

This is a workaround I used eventually
data=[
{'a':'aa',
'b':{'b1':'bb1','b2':'bb2'},
'c':[{
'ca':[{'ca1':'caa1'
}]
}]
}]
df = pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b']]
)
df = pd.concat([df,pd.json_normalize(df['b'])],axis = 1)
df.drop(columns='b',inplace = True)
I still think there should be a better way, but it works

Related

Unable to pull the key:value from dictionary in column to multiple columns

So I was using the solution in this post (Split / Explode a column of dictionaries into separate columns with pandas) but nothing changes in my df.
Here is df before code:
number status_timestamps
0 234234 {"created": "2020-11-30T19:44:42Z", "complete"...
1 2342 {"created": "2020-12-14T13:43:48Z", "complete"...
Here is a sample of the dictionary in that column:
{"created": "2020-11-30T19:44:42Z",
"complete": "2021-01-17T14:20:58Z",
"invoiced": "2020-12-16T22:55:02Z",
"confirmed": "2020-11-30T21:16:48Z",
"in_production": "2020-12-11T18:59:26Z",
"invoice_needed": "2020-12-11T22:00:09Z",
"accepted": "2020-12-01T00:00:23Z",
"assets_uploaded": "2020-12-11T17:16:53Z",
"notified": "2020-11-30T21:17:48Z",
"processing": "2020-12-11T18:49:50Z",
"classified": "2020-12-11T18:49:50Z"}
Here is what I tried and df does not change:
df_final = pd.concat([df, df['status_timestamps'].progress_apply(pd.Series)], axis = 1).drop('status_timestamps', axis = 1)
Here is what happens in a notebook:
Please provide a minimal reproducible working example of what you have tried next time.
If I follow the solution in the mentioned post, it works.
This is the code I have used:
import pandas as pd
json_data = {"created": "2020-11-30T19:44:42Z",
"complete": "2021-01-17T14:20:58Z",
"invoiced": "2020-12-16T22:55:02Z",
"confirmed": "2020-11-30T21:16:48Z",
"in_production": "2020-12-11T18:59:26Z",
"invoice_needed": "2020-12-11T22:00:09Z",
"accepted": "2020-12-01T00:00:23Z",
"assets_uploaded": "2020-12-11T17:16:53Z",
"notified": "2020-11-30T21:17:48Z",
"processing": "2020-12-11T18:49:50Z",
"classified": "2020-12-11T18:49:50Z"}
df = pd.DataFrame({"number": 2342, "status_timestamps": [json_data]})
# fastest solution proposed by your reference post
df.join(pd.DataFrame(df.pop('status_timestamps').values.tolist()))
I was able to use another answer from that post but change to a safer option of literal_eval since it was using eval
Here is working code:
import pandas as pd
from ast import literal_eval
df = pd.read_csv('c:/status_timestamps.csv')
df["status_timestamps"] = df["status_timestamps"].apply(lambda x : dict(literal_eval(x)) )
df2 = df["status_timestamps"].apply(pd.Series )
df_final = pd.concat([df, df2], axis=1).drop('status_timestamps', axis=1)
df_final

Normalize json column and join with rest of dataframe

This is my first question here on stackoverflow so please don't roast me.
I was trying to find similar problems on the internet and actually there are several, but for me the solutions didn't work.
I have created this dataframe:
import pandas as pd
from ast import literal_eval
d = {'order_id': [1], 'email': ["hi#test.com"], 'line_items': ["[{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]"]}
orders = pd.DataFrame(data=d)
It looks like this:
order_id email line_items
1 hi#test.com [{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]
I want the dataframe to look like this:
order_id email line_items.sku line_items.quantity
1 hi#test.com testproduct1 2
1 hi#test.com testproduct2 2
I used the following code to change the type of line_items from string to dict:
orders.line_items = orders.line_items.apply(literal_eval)
Normally I would use json_normalize now to flatten the line_items column. But I also want to keep the id and don't know how to do that. I also want to avoid any loops.
Is there anyone who can help me with this issue?
Kind regards
joant95
If your dictionary really is that strange, then you could try:
d['line_items'] = eval(d['line_items'][0])
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
To create d out of orders you could try:
d = orders.to_dict(orient='list')
Or you could try:
orders.line_items = orders.line_items.map(eval)
d = orders.to_dict(orient='records')
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
But: I still don't have a clear picture of the situation :)

How to read this JSON file in Python?

I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?
import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]
You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)
sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])

ValueError: arrays must all be same length - print dataframe to CSV

thanks for stopping by! I was hoping to get some help creating a csv using pandas dataframe. Here is my code:
a = ldamallet[bow_corpus_new[:21]]
b = data_text_new
print(a)
print("/n")
print(b)
d = {'Preprocessed Document': b['Preprocessed Document'].tolist(),
'topic_0': a[0][1],
'topic_1': a[1][1],
'topic_2': a[2][1],
'topic_3': a[3][1],
'topic_4': a[4][1],
'topic_5': a[5][1],
'topic_6': a[6][1],
'topic_7': a[7][1],
'topic_8': a[8][1],
'topic_9': a[9][1],
'topic_10': a[10][1],
'topic_11': a[11][1],
'topic_12': a[12][1],
'topic_13': a[13][1],
'topic_14': a[14][1],
'topic_15': a[15][1],
'topic_16': a[16][1],
'topic_17': a[17][1],
'topic_18': a[18][1],
'topic_19': a[19][1]}
print(d)
df = pd.DataFrame(data=d)
df.to_csv("test.csv", index=False)
The data:
print(a): the format is in tuples
[[(topic number: 0, topic percentage),...(19, #)], [(topic distribution for next row, #)...(19, .819438),...(#,#),...]
print(b)
Here is my error:
This is the size of the dataframe:
This is what I wished it looked like:
Any help would be greatly appreciated :)
It might be easiest to get the second value of each tuple for all of the rows in it's own list. Something like this
topic_0=[]
topic_1=[]
topic_2=[]
...and so on
for i in a:
topic_0.append(i[0][1])
topic_1.append(i[1][1])
topic_2.append(i[2][1])
...and so on
Then you can make your dictionary like so
d = {'Preprocessed Document': b['Preprocessed Document'].tolist(),
'topic_0': topic_0,
'topic_1': topic_1,
etc. }
I took #mattcremeens advice and it worked. I've posted the full code below. He was right about nixing the tuples my previous code wasn't iterating through the rows but only printed the first row.
topic_0=[]
topic_1=[]
topic_2=[]
topic_3=[]
topic_4=[]
topic_5=[]
topic_6=[]
topic_7=[]
topic_8=[]
topic_9=[]
topic_10=[]
topic_11=[]
topic_12=[]
topic_13=[]
topic_14=[]
topic_15=[]
topic_16=[]
topic_17=[]
topic_18=[]
topic_19=[]
for i in a:
topic_0.append(i[0][1])
topic_1.append(i[1][1])
topic_2.append(i[2][1])
topic_3.append(i[3][1])
topic_4.append(i[4][1])
topic_5.append(i[5][1])
topic_6.append(i[6][1])
topic_7.append(i[7][1])
topic_8.append(i[8][1])
topic_9.append(i[9][1])
topic_10.append(i[10][1])
topic_11.append(i[11][1])
topic_12.append(i[12][1])
topic_13.append(i[13][1])
topic_14.append(i[14][1])
topic_15.append(i[15][1])
topic_16.append(i[16][1])
topic_17.append(i[17][1])
topic_18.append(i[18][1])
topic_19.append(i[19][1])
d = {'Preprocessed Document': b['Preprocessed Document'].tolist(),
'topic_0': topic_0,
'topic_1': topic_1,
'topic_2': topic_2,
'topic_3': topic_3,
'topic_4': topic_4,
'topic_5': topic_5,
'topic_6': topic_6,
'topic_7': topic_7,
'topic_8': topic_8,
'topic_9': topic_9,
'topic_10': topic_10,
'topic_11': topic_11,
'topic_12': topic_12,
'topic_13': topic_13,
'topic_14': topic_14,
'topic_15': topic_15,
'topic_16': topic_16,
'topic_17': topic_17,
'topic_18': topic_18,
'topic_19': topic_19}
df = pd.DataFrame(data=d)
df.to_csv("test.csv", index=False, mode = 'a')

Parsing JSON in Pandas

I need to extract the following json:
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"},{"Name":"disk1","Status":"Passed"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Failed"},{"Name":"disk1","Status":"not supported"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"}]}
Name: raw_results, dtype: object
Into separate columns. I don't know how many disks per result there might be in future. What would be the best way here?
I tried the following:
d = raw_res['raw_results'].map(json.loads).apply(pd.Series).add_prefix('raw_results.')
Gives me:
Example output might be something like
Better way would be to add each disk check as an additional row into dataframe with the same checkid as the row it was extracted from. So for 3 disks in results it will generate 3 rows 1 per disk
UPDATE
This code
# This works
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
df['raw_results'].replace("{}", pd.np.nan, inplace=True)
df = df.dropna()
df.apply(json_to_df, axis=1, json_col='raw_results')
df = pd.concat(dfs)
df.head()
Adds an extra row for each disk (sda, sdb etc.)
So now I would need to split this column into 2: Status and Name.
df1 = df["PhysicalDisks"].apply(pd.Series)
df_final = pd.concat([df, df1], axis = 1).drop('PhysicalDisks', axis = 1)
df_final.head()

Categories

Resources