Load a dataframe from a single json object - python

I have the following json object:
{
"Name": "David",
"Gender": "M",
"Date": "2014-01-01",
"Address": {
"Street": "429 Ford",
"City": "Oxford",
"State": "DE",
"Zip": 1009
}
}
How would I load this into a pandas dataframe so that it orients itself as:
name gender date address
David M 20140-01-01 {...}
What I'm trying now is:
pd.read_json(file)
But it orients it as four records instead of one.

You should read it as a Series and then (optionally) convert to a DataFrame:
df = pd.DataFrame(pd.read_json(file, typ='series')).T
df.shape
#(1, 4)

if your JSON file is composed of 1 JSON object per line (not an array, not a pretty printed JSON object)
then you can use:
df = pd.read_json(file, lines=True)
and it will do what you want
if file contains:
{"Name": "David","Gender": "M","Date": "2014-01-01","Address": {"Street": "429 Ford","City": "Oxford","State": "DE","Zip": 1009}}
on 1 line, then you get:
If you use
df = pd.read_json(file, orient='records')
you can load as 1 key per column, but the sub-keys will be split up into multiple rows.

Related

Compare values in a Json File using Python

I want a python script that take a json file (file.json), and compare the values of the keys
"From",
"To",
"Source",
"Destination",
"Service"
If all the values are the same then it will display the value of their "ID".
Example:
[
{
"ID": "1",
"Name": "Rule A",
"From": "SideD SideB",
"To": SideA SideC",
"Source": "rexA rexB",
"Destination": "proxy gr amz calc",
"Schedule": "always",
"Service": "SSH",
"Action": "ACCEPT"
},
{
"ID": "4",
"Name": "Rule B",
"From": "SideA SideC",
"To": "SideB SideA",
"Source": "amznA amznB amznC",
"Destination": "Reseau Lab Optik",
"Schedule": "always",
"Service": "Snmp telnet",
"Action": "ACCEPT"
},
{
"ID": "6",
"Name": "Rule C",
"From": "SideD SideA",
"To": "SideA SideB",
"Source": "rexB",
"Destination": "proxy gr",
"Schedule": "no",
"Service": "SSH",
"Action": "ACCEPT"
}
]
For this situation, the script will show "ID": 1 and 6, because the keys "From", "To","Source" and "Destination" have at least one same value.
Also put them in a csv file showing the values of the "ID" and alle the rest of the keys and values.
import pandas as pd
from pprint import pprint as prt
with open('file.json') as f:
data = pd.read_json(f)
ids = data["From"]
datas = data[ids.isin(ids[ids.duplicated()])].sort_values("ID")
prt(datas)
IDs only - any one of 4 column match:
You can check each item and each one of four target columns and add matches to a list. There are some precautions needed to make sure repeated matches are not reported which are explained in the comments.
import pandas as pd
with open('file2.json') as f:
data = pd.read_json(f)
# Specify columns to check
cols = ["From", "To", "Source", "Destination"]
# Set ID as index for ease of use
data = data.set_index('ID')
# Emply series to store the matches where index is ID
matches = pd.Series(index=data.index, dtype=object)
# Go through each item
for item_row_num, item_id in enumerate(data.index):
# Empty list to store matches with current item
item_matching_ids = []
# Check each column
for col in cols:
# If there are any matching IDs with current item add them list
# We will check only after current row `data.index[item_row_num:]` so
# for example if [1,6] is detected, it will not be detected again later as [6,1]
check_result = data.loc[data.index[item_row_num:], col].duplicated()
if check_result.any():
item_matching_ids += list(check_result[check_result].index)
# Use set to ensure matching IDs are not repeated
# This can happen because multiple columns are checked seprately
matches.loc[item_id] = set(item_matching_ids)
# Only keep item IDs with at least one match
matches = matches[matches.str.len()>0]
# Save matches to CSV
matches.to_csv('output.csv')
CSV output:
ID 0
1 {6}
All values - all columns should match:
Since you want to write the values to a CSV file, you can use pandas groupby where the aggregate function keeps the first occurence for each column except for ID where it stores the list of matching entries.
import pandas as pd
with open('file.json') as f:
data = pd.read_json(f)
# Make a dictionary where keys are column names
# and values are all 'first' except for ID where value is list
agg_dict = dict.fromkeys(data.columns, 'first')
agg_dict['ID'] = list
# Group rows by desired columns and apply the aggregation
output = data.groupby(["From", "To", "Source", "Destination", "Service"]).agg(agg_dict)
# Write to CSV file ignoring pandas generated index
output.to_csv('output.csv', index=False)
Output CSV file opened in excel:
IDs only - all columns should match:
You can use pandas groupby which groups rows according to given columns and then get list of the IDs of grouped rows:
with open('file.json') as f:
data = pd.read_json(f)
output = list(
data.groupby(["From", "To", "Source", "Destination", "Service"])["ID"].agg(list))
Output:
[[4], [1, 6]]
You can further filter the list to items of at least two match:
output = [ids for ids in output if len(ids)>1]
Output:
[[1, 6]]

Convert http text response to pandas dataframe [duplicate]

This question already has answers here:
Convert Python dict into a dataframe
(18 answers)
JSON to pandas DataFrame
(14 answers)
Closed last year.
I want to convert the below text into a pandas dataframe. Is there a way I can use Python Pandas pre-built or in-built parser to convert? I can make a custom function for parsing but want to know if there is pre-built and/or fast solution.
In this example, the dataframe should result in two rows, one each of ABC & PQR
{
"data": [
{
"ID": "ABC",
"Col1": "ABC_C1",
"Col2": "ABC_C2"
},
{
"ID": "PQR",
"Col1": "PQR_C1",
"Col2": "PQR_C2"
}
]
}
You've listed everything you need as tags. Use json.loads to get a dict from string
import json
import pandas as pd
d = json.loads('''{
"data": [
{
"ID": "ABC",
"Col1": "ABC_C1",
"Col2": "ABC_C2"
},
{
"ID": "PQR",
"Col1": "PQR_C1",
"Col2": "PQR_C2"
}
]
}''')
df = pd.DataFrame(d['data'])
print(df)
Output:
ID Col1 Col2
0 ABC ABC_C1 ABC_C2
1 PQR PQR_C1 PQR_C2

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

Required columns are not present in the data frame but column header gets created in the csv and all rows gets populated with null

I am working on getting data from an API using python. The API returns data in form of json which is being normalised and written to a data frame which is then written to a csv file.
The API can return any number of columns which differs between each records. I need only a fixed number of columns which i am defining in the code.
In the scenario where the required column is not being returned my code fails.
I need a solution where even though required columns are not present in the data frame column header gets created in the csv and all rows gets populated with null.
required csv structure :
name address phone
abc bcd 1214
bcd null null
I'm not sure if understood you correctly but I hope the following code solves your problem:
import json
import pandas as pd
# Declare json with missing values:
# - First element doesn't contain "phone" field
# - Second element doesn't contain "married" field
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd", "married": true},
{ "name": "def", "address": "ghi", "phone" : 7687 }
]
}
"""
json_data = json.loads(api_data)
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Save result to csv:
df.to_csv("tmp.csv", index=False)
The content of resulting csv:
name,address,married,phone
abc,bcd,True,
def,ghi,,7687.0
P.S.:
It should work even if columns are absent in all the records. Here is another example:
# Both elements do not contain "married" and "phone" fields
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd"},
{ "name": "def", "address": "ghi"}
]
}
"""
json_data = json.loads(api_data)
json_data["sentences"][0]
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Print first rows of DataFrame:
df.head()
# Expected output:
# name address married phone
# 0 abc bcd NaN NaN
# 1 def ghi NaN NaN
df.to_csv("tmp.csv", index=False)
In this case the resulting csv file will contain the following text:
name,address,married,phone
abc,bcd,,
def,ghi,,
The last two commas in the 2nd and 3d lines mean "an empty/missing value" and if you create DataFrame from resulting csv by pd.read_csv then "married" and "phone" columns will be populated with NaN values.

Why is a string integer read incorrectly with pandas.read_json?

I am not the one for any hyperbole but I am really stumped by this error and i am sure you will be too..
Here is a simple json object:
[
{
"id": "7012104767417052471",
"session": -1332751885,
"transactionId": "515934477",
"ts": "2019-10-30 12:15:40 AM (+0000)",
"timestamp": 1572394540564,
"sku": "1234",
"price": 39.99,
"qty": 1,
"ex": [
{
"expId": 1007519,
"versionId": 100042440,
"variationId": 100076318,
"value": 1
}
]
}
]
Now I saved the file into ex.json and then executed the following python code:
import pandas as pd
df = pd.read_json('ex.json')
When i see the dataframe the value of my id has changed from "7012104767417052471" to "7012104767417052160"py
Does anyone understand why python does this? I tried it in node, js, and even excel and it is looking fine in everything else..
If I do this I get the right id:
with open('Siva.json') as data_file:
data = json.load(data_file)
df = json_normalize(data)
But I want to understand why pandas doesn't process json properly in a strange way.
This is a known issue:
This has been an OPEN issue since 2018-04-04
read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608
As stated in the issue. Explicitly designate the dtype to get the correct number.
import pandas as pd
df = pd.read_json('test.json', dtype={'id': 'int64'})
id session transactionId ts timestamp sku price qty ex
7012104767417052471 -1332751885 515934477 2019-10-30 12:15:40 AM (+0000) 2019-10-30 00:15:40.564 1234 39.99 1 [{'expId': 1007519, 'versionId': 100042440, 'variationId': 100076318, 'value': 1}]

Categories

Resources