Compare values in a Json File using Python

Compare values in a Json File using Python - python

I want a python script that take a json file (file.json), and compare the values of the keys
"From",
"To",
"Source",
"Destination",
"Service"
If all the values are the same then it will display the value of their "ID".
Example:
[
{
"ID": "1",
"Name": "Rule A",
"From": "SideD SideB",
"To": SideA SideC",
"Source": "rexA rexB",
"Destination": "proxy gr amz calc",
"Schedule": "always",
"Service": "SSH",
"Action": "ACCEPT"
},
{
"ID": "4",
"Name": "Rule B",
"From": "SideA SideC",
"To": "SideB SideA",
"Source": "amznA amznB amznC",
"Destination": "Reseau Lab Optik",
"Schedule": "always",
"Service": "Snmp telnet",
"Action": "ACCEPT"
},
{
"ID": "6",
"Name": "Rule C",
"From": "SideD SideA",
"To": "SideA SideB",
"Source": "rexB",
"Destination": "proxy gr",
"Schedule": "no",
"Service": "SSH",
"Action": "ACCEPT"
}
]
For this situation, the script will show "ID": 1 and 6, because the keys "From", "To","Source" and "Destination" have at least one same value.
Also put them in a csv file showing the values of the "ID" and alle the rest of the keys and values.
import pandas as pd
from pprint import pprint as prt
with open('file.json') as f:
data = pd.read_json(f)
ids = data["From"]
datas = data[ids.isin(ids[ids.duplicated()])].sort_values("ID")
prt(datas)

IDs only - any one of 4 column match:
You can check each item and each one of four target columns and add matches to a list. There are some precautions needed to make sure repeated matches are not reported which are explained in the comments.
import pandas as pd
with open('file2.json') as f:
data = pd.read_json(f)
# Specify columns to check
cols = ["From", "To", "Source", "Destination"]
# Set ID as index for ease of use
data = data.set_index('ID')
# Emply series to store the matches where index is ID
matches = pd.Series(index=data.index, dtype=object)
# Go through each item
for item_row_num, item_id in enumerate(data.index):
# Empty list to store matches with current item
item_matching_ids = []
# Check each column
for col in cols:
# If there are any matching IDs with current item add them list
# We will check only after current row `data.index[item_row_num:]` so
# for example if [1,6] is detected, it will not be detected again later as [6,1]
check_result = data.loc[data.index[item_row_num:], col].duplicated()
if check_result.any():
item_matching_ids += list(check_result[check_result].index)
# Use set to ensure matching IDs are not repeated
# This can happen because multiple columns are checked seprately
matches.loc[item_id] = set(item_matching_ids)
# Only keep item IDs with at least one match
matches = matches[matches.str.len()>0]
# Save matches to CSV
matches.to_csv('output.csv')
CSV output:
ID 0
1 {6}
All values - all columns should match:
Since you want to write the values to a CSV file, you can use pandas groupby where the aggregate function keeps the first occurence for each column except for ID where it stores the list of matching entries.
import pandas as pd
with open('file.json') as f:
data = pd.read_json(f)
# Make a dictionary where keys are column names
# and values are all 'first' except for ID where value is list
agg_dict = dict.fromkeys(data.columns, 'first')
agg_dict['ID'] = list
# Group rows by desired columns and apply the aggregation
output = data.groupby(["From", "To", "Source", "Destination", "Service"]).agg(agg_dict)
# Write to CSV file ignoring pandas generated index
output.to_csv('output.csv', index=False)
Output CSV file opened in excel:
IDs only - all columns should match:
You can use pandas groupby which groups rows according to given columns and then get list of the IDs of grouped rows:
with open('file.json') as f:
data = pd.read_json(f)
output = list(
data.groupby(["From", "To", "Source", "Destination", "Service"])["ID"].agg(list))
Output:
[[4], [1, 6]]
You can further filter the list to items of at least two match:
output = [ids for ids in output if len(ids)>1]
Output:
[[1, 6]]

Related

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!

The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

How to clear an array and reset the values in a for loop while building a Json string?

I am looping through each row in an excel sheet using the openpyxl import to ultimately build a large Json string that i can feed to an API.
I am looping through each row and building out my json structure, I need to split a cell value by " || " and then for each value it needs to be added as a nested array inside a json section. I currently am using the following code and my problem is that I build my list object in my for loop and append the json chunk to a larger array and it keeps appending my list values during each loop. So i used the .Clear() method on the list to clear it after each loop...but then when i compile my final output my list is empty. Its like it does not maintain its values when it is added to the list each loop. I am new to Python and gave it a good whirl. Any suggestions in the right direction would be appreciated. Its almost like each loop needs its own unique array to use and keep the values. The tags section of the Json is emptied in the final output for each json line...when it should have the values for each unique iteration in it.
My Data Set (i have 3 rows in excel). You can see that i have values that i want to split in the 7th column. That is the column i am looping through to split the values as they will be nested in my json.
Row 1 (cells) = "ABC","Testing","Testing Again","DATE","DATE",Empty,"A || B || C".
Row 2 (cells) = "ABC 2","Testing 2","Testing Again 2","DATE","DATE",Empty,"X || Y || Z".
Row 3 (cells) = "ABC 3","Testing 3","Testing Again 3","DATE","DATE",Empty,Empty.
My Code.
#from openpyxl import Workbook
import json
from openpyxl import load_workbook
output_table = input_table.copy()
var_path_excel_file = flow_variables['Location']
workbook = load_workbook(filename=var_path_excel_file)
sheet = workbook.active
#create a null value to be used
emptyString = "Null"
#list out all of the sections of the json that we want to print out - these are based on the inputs
jsonFull = []
jsondata = {}
tags = []
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
data = value[6].split(" || ")
for temp in data:
tags.append(temp)
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)
tags.clear()
print(json.dumps(jsonFull))
And then my desired outcome would be something like this. I just need to figure out the proper syntax for the list handling...and can't seem to find an example to base off of.
[
{
"name": "ABC",
"short_description": "Testing",
"long_description": "Testing Again",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"A",
"B",
"C"
]
},
{
"name": "ABC 2",
"short_description": "Testing 2",
"long_description": "Testing Again 2",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
"X",
"Y",
"Z"
]
},
{
"name": "ABC 3",
"short_description": "Testing 3",
"long_description": "Testing Again 3",
"effective_start_date": "2020-03-04T14:45:22Z",
"effective_end_date": "2020-03-04T14:45:22Z",
"workflow_state": "Null",
"tags": [
]
}
]

You're not making a copy of tags when you put it into the dictionary or call tags.clear(), you're just putting a reference to the same list. You need to create a new list at the beginning of each loop iteration, not reuse the same list.
for value in sheet.iter_rows(min_row=2,min_col=0,max_col=40,values_only=True):
#I add my split values to an array so that way when i add the array to the json it will have the proper brackets i need for the API to run correctly
if value[6] is not None:
tags = value[6].split(" || ")
else:
tags = []
#I build out the json structure here that will be added for each excel row basically
jsondata = {
"name": value[0],
"short_description": value[1],
"long_description": value[2],
"effective_start_date": value[3],
"effective_end_date": value[4],
"workflow_state": emptyString,
"tags": tags
}
#Add the jsondata row to the larger collection
jsonFull.append(jsondata)

Required columns are not present in the data frame but column header gets created in the csv and all rows gets populated with null

I am working on getting data from an API using python. The API returns data in form of json which is being normalised and written to a data frame which is then written to a csv file.
The API can return any number of columns which differs between each records. I need only a fixed number of columns which i am defining in the code.
In the scenario where the required column is not being returned my code fails.
I need a solution where even though required columns are not present in the data frame column header gets created in the csv and all rows gets populated with null.
required csv structure :
name address phone
abc bcd 1214
bcd null null

I'm not sure if understood you correctly but I hope the following code solves your problem:
import json
import pandas as pd
# Declare json with missing values:
# - First element doesn't contain "phone" field
# - Second element doesn't contain "married" field
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd", "married": true},
{ "name": "def", "address": "ghi", "phone" : 7687 }
]
}
"""
json_data = json.loads(api_data)
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Save result to csv:
df.to_csv("tmp.csv", index=False)
The content of resulting csv:
name,address,married,phone
abc,bcd,True,
def,ghi,,7687.0
P.S.:
It should work even if columns are absent in all the records. Here is another example:
# Both elements do not contain "married" and "phone" fields
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd"},
{ "name": "def", "address": "ghi"}
]
}
"""
json_data = json.loads(api_data)
json_data["sentences"][0]
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Print first rows of DataFrame:
df.head()
# Expected output:
# name address married phone
# 0 abc bcd NaN NaN
# 1 def ghi NaN NaN
df.to_csv("tmp.csv", index=False)
In this case the resulting csv file will contain the following text:
name,address,married,phone
abc,bcd,,
def,ghi,,
The last two commas in the 2nd and 3d lines mean "an empty/missing value" and if you create DataFrame from resulting csv by pd.read_csv then "married" and "phone" columns will be populated with NaN values.

Load a dataframe from a single json object

I have the following json object:
{
"Name": "David",
"Gender": "M",
"Date": "2014-01-01",
"Address": {
"Street": "429 Ford",
"City": "Oxford",
"State": "DE",
"Zip": 1009
}
}
How would I load this into a pandas dataframe so that it orients itself as:
name gender date address
David M 20140-01-01 {...}
What I'm trying now is:
pd.read_json(file)
But it orients it as four records instead of one.

You should read it as a Series and then (optionally) convert to a DataFrame:
df = pd.DataFrame(pd.read_json(file, typ='series')).T
df.shape
#(1, 4)

if your JSON file is composed of 1 JSON object per line (not an array, not a pretty printed JSON object)
then you can use:
df = pd.read_json(file, lines=True)
and it will do what you want
if file contains:
{"Name": "David","Gender": "M","Date": "2014-01-01","Address": {"Street": "429 Ford","City": "Oxford","State": "DE","Zip": 1009}}
on 1 line, then you get:
If you use
df = pd.read_json(file, orient='records')
you can load as 1 key per column, but the sub-keys will be split up into multiple rows.

Json to CSV Conversion

I have a very large JSON file with multiple individual JSON objects in the format shown below. I am trying to convert it to a CSV so that each row is a combination of the outer id/name/alphabet in a JSON object and 1 set of conversion: id/name/alphabet. This is repeated for all the sets of id/name/alphabet within an individual JSON object. So from the object below, 2 rows should be created where the first row is (outer) id/name/alphabet and 1st id/name/alphabet of conversion. The second row is again (outer) id/name/alphabet and now the 2nd id/name/alphabet of conversion.
Important note is that certain Objects in the file can have upwards of 50/60 conversion id/name/alphabet pairs.
What I tried so far was to flatten the JSON objects first which resulted in keys like conversion_id_0 and conversion_id_1 etc... so I can map the outer as its always constant but I am unsure how to map each corresponding number set to be a seperate row.
Any help or insight would be greatly appreciated!
[
{
"alphabet": "ABCDEFGHIJKL",
"conversion": [
{
"alphabet": "BCDEFGHIJKL",
"id": 18589260,
"name": [
"yy"
]
},
{
"alphabet": "EFGHIJEFGHIJ",
"id": 18056632,
"name": [
"zx",
"cd"
]
}
],
"id": 23929934,
"name": [
"x",
"y"
]
}
]

Your question is unclear about exactly the mapping from input JSON data to rows of the CSV file, so I had to guess on what should happen when there's more than one "name" associated with an inner or outer object.
Regardless, hopefully the following will give you a general idea of how to solve such problems.
import csv
objects = [
{
"alphabet": "ABCDEFGHIJKL",
"id": 23929934,
"name": [
"x",
"y"
],
"conversion": [
{
"alphabet": "BCDEFGHIJKL",
"id": 18589260,
"name": [
"yy"
]
},
{
"alphabet": "EFGHIJEFGHIJ",
"id": 18056632,
"name": [
"zx",
"cd"
]
}
],
}
]
with open('converted_json.csv', 'wb') as outfile:
def group(item):
return [item["id"], item["alphabet"], ' '.join(item["name"])]
writer = csv.writer(outfile, quoting=csv.QUOTE_NONNUMERIC)
for obj in objects:
outer = group(obj)
for conversion in obj["conversion"]:
inner = group(conversion)
writer.writerow(outer + inner)
Contents of the CSV file generated:
23929934,"ABCDEFGHIJKL","x y",18589260,"BCDEFGHIJKL","yy"
23929934,"ABCDEFGHIJKL","x y",18056632,"EFGHIJEFGHIJ","zx cd"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare values in a Json File using Python - python

Related

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

How to clear an array and reset the values in a for loop while building a Json string?

Required columns are not present in the data frame but column header gets created in the csv and all rows gets populated with null

Load a dataframe from a single json object

Json to CSV Conversion

Categories

Resources