I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?
import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]
You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)
sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])
Related
I'm trying to get the metadata out from a json using pandas json_normalize, but it does not work as expected.
I have a json fine with the following structure
data=[
{'a':'aa',
'b':{'b1':'bb1','b2':'bb2'},
'c':[{
'ca':[{'ca1':'caa1'
}]
}]
}]
I'd like to get the following
ca1
a
b.b1
caa1
aa
bb1
I would expect this to work
pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b','b1']])
but it doesn't find the key b1. Strangely enough if my record_path is 'c' alone it does find the key.
I feel I'm missing something here, but I can't figure out what.
I appreciate any help!
Going down first level you grab the meta as a list of columns you want to keep. Record path use a list to map levels that you want to go down. Finally column b is a dict you can apply to a Series concat back into df and pop to remove unpacked dict column.
df = pd.json_normalize(
data=data,
meta=['a', 'b'],
record_path=['c', 'ca']
)
df = pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
print(df)
Output:
ca1 a b1 b2
0 caa1 aa bb1 bb2
This is a workaround I used eventually
data=[
{'a':'aa',
'b':{'b1':'bb1','b2':'bb2'},
'c':[{
'ca':[{'ca1':'caa1'
}]
}]
}]
df = pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b']]
)
df = pd.concat([df,pd.json_normalize(df['b'])],axis = 1)
df.drop(columns='b',inplace = True)
I still think there should be a better way, but it works
So I was using the solution in this post (Split / Explode a column of dictionaries into separate columns with pandas) but nothing changes in my df.
Here is df before code:
number status_timestamps
0 234234 {"created": "2020-11-30T19:44:42Z", "complete"...
1 2342 {"created": "2020-12-14T13:43:48Z", "complete"...
Here is a sample of the dictionary in that column:
{"created": "2020-11-30T19:44:42Z",
"complete": "2021-01-17T14:20:58Z",
"invoiced": "2020-12-16T22:55:02Z",
"confirmed": "2020-11-30T21:16:48Z",
"in_production": "2020-12-11T18:59:26Z",
"invoice_needed": "2020-12-11T22:00:09Z",
"accepted": "2020-12-01T00:00:23Z",
"assets_uploaded": "2020-12-11T17:16:53Z",
"notified": "2020-11-30T21:17:48Z",
"processing": "2020-12-11T18:49:50Z",
"classified": "2020-12-11T18:49:50Z"}
Here is what I tried and df does not change:
df_final = pd.concat([df, df['status_timestamps'].progress_apply(pd.Series)], axis = 1).drop('status_timestamps', axis = 1)
Here is what happens in a notebook:
Please provide a minimal reproducible working example of what you have tried next time.
If I follow the solution in the mentioned post, it works.
This is the code I have used:
import pandas as pd
json_data = {"created": "2020-11-30T19:44:42Z",
"complete": "2021-01-17T14:20:58Z",
"invoiced": "2020-12-16T22:55:02Z",
"confirmed": "2020-11-30T21:16:48Z",
"in_production": "2020-12-11T18:59:26Z",
"invoice_needed": "2020-12-11T22:00:09Z",
"accepted": "2020-12-01T00:00:23Z",
"assets_uploaded": "2020-12-11T17:16:53Z",
"notified": "2020-11-30T21:17:48Z",
"processing": "2020-12-11T18:49:50Z",
"classified": "2020-12-11T18:49:50Z"}
df = pd.DataFrame({"number": 2342, "status_timestamps": [json_data]})
# fastest solution proposed by your reference post
df.join(pd.DataFrame(df.pop('status_timestamps').values.tolist()))
I was able to use another answer from that post but change to a safer option of literal_eval since it was using eval
Here is working code:
import pandas as pd
from ast import literal_eval
df = pd.read_csv('c:/status_timestamps.csv')
df["status_timestamps"] = df["status_timestamps"].apply(lambda x : dict(literal_eval(x)) )
df2 = df["status_timestamps"].apply(pd.Series )
df_final = pd.concat([df, df2], axis=1).drop('status_timestamps', axis=1)
df_final
My code can get the job done but I know it is not a good way to handle it.
The input is thisdict and the output is shown at the end.
Can you help to make it more efficient?
import pandas as pd
thisdict = {
"A": {'v1':'3','v2':5},
"B": {'v1':'77','v2':99},
"ZZ": {'v1':'311','v2':152}
}
output=pd.DataFrame()
for key, value in thisdict.items():
# turn value to df
test2 =pd.DataFrame(value.items(), columns = ['item','value'])
test2['id'] = key
#transpose
test2 = test2.pivot(index='id',columns='item', values = 'value')
#concat
output=pd.concat([output,test2])
output
You can use:
output = pd.DataFrame.from_dict(thisdict, orient='index')
or
output = pd.DataFrame(thisdict).T
and if you wish, rename the index by:
output.index.rename('id', inplace=True)
I have a json file.
[
{
'orderId': 1811,
'deliveryId': '000001811-1634732661563000',
'shippingBook': '[{"qtyOrdered":1,"bookNoList":["B8303-V05","B8304-V05","B8305-V05","B8306-V05","B8307-V05"],"courseCode":"A8399-S26"},{"courseCode":"A1399-S70","qtyOrdered":1,"bookNoList":["B1301-V06","B1302-V06","B1303-V06","B1304-V06","B1305-1-V06","B1305-2-V06","B1306-V06","B1307-V06"]}]',
}
]
but how can i display in dataframe in format
thank you
You have string in 'shippingBook' which may need json.loads() to convert it to Python's list with dictionaries.
And you could use normal for-loops to convert all data to normal list with expected data - and later convert it to DataFrame
import json
import pandas as pd
data = [
{
'orderId': 1811,
'deliveryId': '000001811-1634732661563000',
'shippingBook': '[{"qtyOrdered":1,"bookNoList":["B8303-V05","B8304-V05","B8305-V05","B8306-V05","B8307-V05"],"courseCode":"A8399-S26"},{"courseCode":"A1399-S70","qtyOrdered":1,"bookNoList":["B1301-V06","B1302-V06","B1303-V06","B1304-V06","B1305-1-V06","B1305-2-V06","B1306-V06","B1307-V06"]}]',
}
]
# --- organize data ---
all_rows = []
for order in data:
order_id = order['orderId']
delivery_id = order['deliveryId']
for book in json.loads(order['shippingBook']):
row = [order_id, delivery_id, book['courseCode'], book['bookNoList']]
#print(row)
all_rows.append(row)
# --- convert to DataFrame ---
df = pd.DataFrame(all_rows, columns=['orderId', 'deliveryId', 'courseCode', 'bookNoList'])
print(df.to_string()) # `to_string()` to display all data without `...`
Result:
orderId deliveryId courseCode bookNoList
0 1811 000001811-1634732661563000 A8399-S26 [B8303-V05, B8304-V05, B8305-V05, B8306-V05, B8307-V05]
1 1811 000001811-1634732661563000 A1399-S70 [B1301-V06, B1302-V06, B1303-V06, B1304-V06, B1305-1-V06, B1305-2-V06, B1306-V06, B1307-V06]
EDIT:
You may also try do the same directly in DataFrame.
It needs explode to split list into rows
import json
import pandas as pd
data = [
{
'orderId': 1811,
'deliveryId': '000001811-1634732661563000',
'shippingBook': '[{"qtyOrdered":1,"bookNoList":["B8303-V05","B8304-V05","B8305-V05","B8306-V05","B8307-V05"],"courseCode":"A8399-S26"},{"courseCode":"A1399-S70","qtyOrdered":1,"bookNoList":["B1301-V06","B1302-V06","B1303-V06","B1304-V06","B1305-1-V06","B1305-2-V06","B1306-V06","B1307-V06"]}]',
}
]
#df = pd.DataFrame.from_records(data)
df = pd.DataFrame(data)
# convert string to list with dictionares
df['shippingBook'] = df['shippingBook'].apply(json.loads)
# split list `'shippingBook'` into rows
df = df.explode('shippingBook')
df = df.reset_index()
del df['index']
# split elements into columns
#df['courseCode'] = df['shippingBook'].apply(lambda item:item['courseCode'])
#df['bookNoList'] = df['shippingBook'].apply(lambda item:item['bookNoList'])
df['courseCode'] = df['shippingBook'].str['courseCode'] # unexpected behaviour for string functions `.str`
df['bookNoList'] = df['shippingBook'].str['bookNoList'] # unexpected behaviour for string functions `.str`
# remove `'shippingBook'`
del df['shippingBook']
print(df.to_string())
And the same with apply(pd.Series) to convert list into columns.
import json
import pandas as pd
data = [
{
'orderId': 1811,
'deliveryId': '000001811-1634732661563000',
'shippingBook': '[{"qtyOrdered":1,"bookNoList":["B8303-V05","B8304-V05","B8305-V05","B8306-V05","B8307-V05"],"courseCode":"A8399-S26"},{"courseCode":"A1399-S70","qtyOrdered":1,"bookNoList":["B1301-V06","B1302-V06","B1303-V06","B1304-V06","B1305-1-V06","B1305-2-V06","B1306-V06","B1307-V06"]}]',
}
]
#df = pd.DataFrame.from_records(data)
df = pd.DataFrame(data)
# convert string to list with dictionares
df['shippingBook'] = df['shippingBook'].apply(json.loads)
# split list `'shippingBook'` into rows
df = df.explode('shippingBook')
df = df.reset_index()
del df['index']
# split elements into columns
new_columns = df['shippingBook'].apply(pd.Series)
#df[['qtyOrdered', 'bookNoList', 'courseCode']] = new_columns
#del df['qtyOrdered']
#df[['bookNoList', 'courseCode']] = new_columns[['bookNoList', 'courseCode']]
df = df.join(new_columns[['bookNoList', 'courseCode']])
# remove `'shippingBook'`
del df['shippingBook']
print(df.to_string())
I have some data in a pandas DataFrame, but one of the columns contains multi-line JSON. I am trying to parse that JSON out into a separate DataFrame along with the CustomerId. Here you will see my DataFrame...
df
Out[1]:
Id object
CustomerId object
CallInfo object
Within the CallInfo column, the data looks like this...
[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]
I want to create a new DataFrame called df_norm which contains the CustomerId, CallDate, and CallLength.
I have tried several ways but couldn't find a working solution. Can anyone help me with this?
Mock up code example...
import pandas as pd
import json
Id = [1, 2, 3]
CustomerId = [700001, 700002, 700003]
CallInfo = ['[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]', '[{"CallDate":"2021-07-09","CallLength":102}]', '[{"CallDate":"2021-07-11","CallLength":226},{"CallDate":"2021-07-11","CallLength":216}]']
# Reconstruct sample DataFrame
df = pd.DataFrame({
"Id": Id,
"CustomerId": CustomerId,
"CallInfo": CallInfo
})
print(df)
This should work. Create a new list of rows and then toss that into the pd.DataFrame constructor:
new_rows = [{
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': item['CallDate'],
'CallLength': item['CallLength']}
for _, row in df.iterrows() for item in json.loads(row['CallInfo'])]
new_df = pd.DataFrame(new_rows)
print(new_df)
EDIT: to account for None values in CallInfo column:
new_rows = []
for _, row in df.iterrows():
call_date = None
call_length = None
if row['CallInfo'] is not None: # Or additional checks, e.g. == "" or something...
for item in json.loads(row['CallInfo']):
call_date = item['CallDate']
call_length = item['CallLength']
new_rows.append({
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': call_date,
'CallLength': call_length})