How do I parse multiple levels(arrays) within JSON file using python? - python

Here is the sample JSON which I am trying to parse in python.
I am having hard time parsing through "files":
Any help appreciated.
{
"startDate": "2016-02-19T08:19:30.764-0700",
"endDate": "2016-02-19T08:20:19.058-07:00",
"files": [
{
"createdOn": "2017-02-19T08:20:19.391-0700",
"contentType": "text/plain",
"type": "jpeg",
"id": "Photoshop",
"fileUri": "output.log"
}
],
"status": "OK",
"linkToken": "-3418849688029673541",
"subscriberViewers": [
"null"
]
}

To print the id of each file in the array:
import json
data = json.loads(rawdata)
files = data['files']
for f in files:
print(f['id'])

Related

Looking to generically convert JSON file to CSV in Python

Tried solution shared in link :: Nested json to csv - generic approach
This worked for Sample 1 , but giving only a single row for Sample 2.
is there a way to have generic python code to handle both Sample 1 and Sample 2.
Sample 1 ::
{
"Response": "Success",
"Message": "",
"HasWarning": false,
"Type": 100,
"RateLimit": {},
"Data": {
"Aggregated": false,
"TimeFrom": 1234567800,
"TimeTo": 1234567900,
"Data": [
{
"id": 11,
"symbol": "AAA",
"time": 1234567800,
"block_time": 123.282828282828,
"block_size": 1212121,
"current_supply": 10101010
},
{
"id": 12,
"symbol": "BBB",
"time": 1234567900,
"block_time": 234.696969696969,
"block_size": 1313131,
"current_supply": 20202020
},
]
}
}
Sample 2::
{
"Response": "Success",
"Message": "Summary succesfully returned!",
"Data": {
"11": {
"Id": "3333",
"Url": "test/11.png",
"value": "11",
"Name": "11 entries (11)"
},
"122": {
"Id": "5555555",
"Url": "test/122.png",
"Symbol": "122",
"Name": "122 cases (122)"
}
},
"Limit": {},
"HasWarning": False,
"Type": 50
}
Try this, you need to install flatten_json from here
import sys
import csv
import json
from flatten_json import flatten
data = json.load(open(sys.argv[1]))
data = flatten(data)
with open('foo.csv', 'w') as f:
out = csv.DictWriter(f, data.keys())
out.writeheader()
out.writerow(data)
Output
> cat foo.csv
Response,Message,Data_11_Id,Data_11_Url,Data_11_value,Data_11_Name,Data_122_Id,Data_122_Url,Data_122_Symbol,Data_122_Name,Limit,HasWarning,Type
Success,Summary succesfully returned!,3333,test/11.png,11,11 entries (11),5555555,test/122.png,122,122 cases (122),{},False,50
Note: False is incorrect in Json, you need to change it to false

Using Pandas to convert csv to Json

I want to convert a CSV to a JSON format using pandas. I am a tester and want to send some events to Event Hub for that I want to maintain a CSV file and update my records/data using the CSV file. I created a CSV file by reading a JSON using pandas for reference. Now when I am again converting the CSV into JSON using pandas< the data is not getting displayed in the correct format. Can you please help.
Step 1: Converted JSON to CSV using pandas:
df = pd.read_json('C://Users//DAMALI//Desktop/test.json')
df.to_csv('C://Users//DAMALI//Desktop/test.csv')
Step2: Now if I try to convert the JSON again to CSV, it's not getting converted in the same format as earlier:
df = pd.read_csv('C://Users//DAMALI//Desktop/test.csv')
df.to_json('C://Users//DAMALI//Desktop/test1.json')
Providing JSON below:
{
"body": {
"deviceId": "UDM",
"registrationDate": "12/11/2019",
"testRegistration": false,
"serialNumber": "25",
"articleNumber": "R91",
"deviceName": "UDM-test",
"locationId": "lc0",
"sapSoldToId": "1138474",
"crmDomainAccountId": "1234566",
"crmAccountDetails": {
"accountName": "ProjectX",
"accountId": "Instal",
"region": "AP"
},
"productLine": "UD",
"state": "registered",
"installerName": "ABC Rooms",
"installationAddress": {
"street": "Benelu",
"zipCode": "850",
"city": "Kortr",
"state": "OVL",
"country": "Belgi"
},
"customerDetails": {
"name": "John D",
"contactName": "John Doe",
"phone": "+32 999999999",
"email": "john.doe#test.com"
},
"wallConnect": {
"wallSize": "Width 5 x Height 4",
"wallOrientation": "LANDSCAPE",
"displayType": "BVD-D55M21H321A1C300",
"softwareVersion": "1.13.1.1.3"
},
"projector": {
"name": "UDX 40K-123456789",
"subType": "UDX 40K"
},
"featureLicense": ["UDX-aa00213a-5719-440e-a3b5", "UDX-aa00a-571"],
"cloudServiceLicense": ["EN04d5-4d2a-9131-875ad37c5883", "E15-4d2a-9131-875ad37c5154"],
"metadata": {
"cusQuesAns": [{
"ques": "End ucal industry",
"ans": "Hosity",
"key": "CUST_ANSWER"
},
{
"ques": "End user video wall application",
"ans": "Simulation & Virtual Reality",
"key": "CUSSECOND_ANSWER"
}
]
},
"frequency": "realtime",
"subDevices": [{
"deviceType": "DISPLAY",
"serialNumber": "68960",
"articleNumber": "R792",
"wallConnect": {
"displayFMWVersion": "3.0.0",
"displayVariant": "KVD21H331A1C300"
}
}]
},
"properties": {
"drs": {
"type": "salesforce-lm"
}
},
"systemProperties": {
"user-id": "data-cvice",
"message-id": "1b1012cc-9b18c192"
}
}
Try this for converting CSV to JSON
import pandas as pd
df = pd.read_csv (r'Fayzan-Bhatti\test.csv')
df.to_json (r'Fayzan-Bhatti\new_test.json')

Using python to copy a jsonl.gz file from S3 to ABS and unzip along the way

I am using a Databricks notebook to copy a jsonl.gz file from S3 to ABS (my ABS container is already mounted), and need the file to be unzipped at the end of the process. The filenames will be fed into the notebook using the 'directory' and 'fileun' variables. An example filename is 'folder-date/0000-00-0000.jsonl.gz'.
I am having difficulty figuring out the exact syntax for this. Currently I am getting stuck on trying to read the jsonl.gz file into a dataframe. The error I get is "Invalid file path or buffer object type: <class 'dict'>". Here is what I have so far, any help is appreciated:
fileun = dbutils.widgets.get("fileun")
directory = dbutils.widgets.get("directory")
file = fileun[:-3]
file_path=directory+fileun
import pandas as pd
import numpy as np
import boto3
import pyodbc
import gzip
import shutil
client = boto3.client(
"s3",
region_name='region',
aws_access_key_id='key',
aws_secret_access_key='key'
)
response=client.get_object(
Bucket='bucket_name',
Key=file_path
)
**df = pd.read_json(response, compression='infer')**
with gzip.open(response, 'rb') as f_in:
with open(file, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
writefile = open('container_name' % (fileun), 'w')
writefile.write(df)
writefile.close()
Here is a snippet from one of the files:
{
"uid": "9a926d799f437b0d279c144dec2bcef7cd16db341bca6e7653246d960331d00a",
"doc": {
"snippet": {
"authorProfileImageUrl": null,
"textDisplay": "#Keanu Corps  Same \"reasoning\" as the democrats. Even though inflation is getting worse.",
"publishedAt": "2021-10-28T09:34:15.334995+0000",
"authorChannelUrl": "/channel/UCsxSW7_bBsbFAjkh7ujctRA",
"authorChannelId": {
"value": "UCsxSW7_bBsbFAjkh7ujctRA"
},
"likeCount": 0,
"videoId": "U2R_srS4TR4",
"authorDisplayName": "The Video Game Hunger 01"
},
"crawlid": "-",
"kind": "youtube#comment",
"correlation_id": "195c9442-74d8-5003-baa3-7f1d05ef5aa6",
"id": "UgznKXgYDDWUEjX0YaZ4AaABAg.9TvZiuGoiMM9U1jbFAWCTn",
"parentId": "UgznKXgYDDWUEjX0YaZ4AaABAg",
"is_reply": true,
"timestamp": "2021-10-28T18:34:15.437543"
},
"system_timestamp": "2021-10-28T18:34:15.944581+00:00",
"norm_attribs": {
"website": "github.com/-",
"type": "youtube",
"version": "1.0"
},
"type": "youtube_comment",
"norm": {
"author": "The Video Game Hunger 01",
"domain": "youtube.com",
"id": "UgznKXgYDDWUEjX0YaZ4AaABAg.9TvZiuGoiMM9U1jbFAWCTn",
"body": "#Keanu Corps  Same \"reasoning\" as the democrats. Even though inflation is getting worse.",
"author_id": "UCsxSW7_bBsbFAjkh7ujctRA",
"url": "https://www.youtube.com/watch?v=U2R_srS4TR4&lc=UgznKXgYDDWUEjX0YaZ4AaABAg.9TvZiuGoiMM9U1jbFAWCTn",
"timestamp": "2021-10-28T09:34:15.334995+00:00"
},
"organization_id": "-",
"sub_organization_id": "default",
"campaign_id": "-",
"project_id": "default",
"project_version_id": "default",
"meta": {
"relates_to_timestamp": [
{
"results": [
"2021-10-28T09:34:15.334995+00:00"
],
"attribs": {
"website": "github.com/-",
"source": "Explicit",
"type": "Timestamp Extractor",
"version": "1.0"
}
}
],
"post_type": [
{
"results": [
"post"
],
"attribs": {
"website": "github.com/-",
"source": "Explicit",
"type": "Post Type Extractor",
"version": "1.0"
}
}
],
"relates_to": [
{
"results": [
"U2R_srS4TR4"
],
"attribs": {
"website": "github.com/-",
"source": "Explicit",
"type": "String Extractor",
"version": "1.0"
}
}
],
"author_name": [
{
"results": [
"The Video Game Hunger 01"
],
"attribs": {
"website": "github.com/-",
"source": "Explicit",
"type": "String Extractor",
"version": "1.0"
}
}
],
"author_id": [
{
"results": [
"UCsxSW7_bBsbFAjkh7ujctRA"
],
"attribs": {
"website": "github.com/-",
"source": "Explicit",
"type": "String Extractor",
"version": "1.0"
}
}
],
"rule_matcher": [
{
"results": [
{
"metadata": {
"campaign_title": "-",
"project_title": "-",
"maxdepth": 0
},
"sub_organization_id": null,
"description": null,
"project_version_id": "-",
"rule_id": "2569463",
"rule_tag": "-",
"rule_type": "youtube_keyword",
"project_id": "-",
"appid": "nats-main",
"organization_id": "-",
"value": "طالبان شلیک",
"campaign_id": "-",
"node_id": null
}
],
"attribs": {
"website": "github.com/-",
"source": "Explicit",
"type": "youtube",
"version": "1.0"
}
}
]
}
}
Without knowing the structure of your json inside the .gz file, it's tough to say exactly how to help.
This is what I use to download from .gz from s3 directly to dataframe.
import gzip
import json
s3sr = boto3.resource('s3')
obj =s3sr.Object(bucket, key)
data = json.loads(gzip.decompress(obj.get()['Body'].read()))
df = pd.DataFrame(data)
And if opening .gz from local, I use this
with gzip.open(fullpath, 'rb') as f:
data = json.loads(f.read().decode('utf-8'))
df = pd.DataFrame(data)
This is what finally ended up working to copy the file, but I am losing the individual objects within the file and they are showing up on just one line:
fileun = dbutils.widgets.get("fileun")
directory = dbutils.widgets.get("directory")
key=directory+fileun
import pandas as pd
import numpy as np
import boto3
import pyodbc
import os
import gzip
import shutil
import json
s3_resource = boto3.resource('s3',
aws_access_key_id=[key_id],
aws_secret_access_key= [access_key]
my_bucket = s3_resource.Bucket(bucket_name)
objects = my_bucket.objects.filter(Prefix= directory+fileun)
for obj in objects:
path, filename = os.path.split(obj.key)
my_bucket.download_file(obj.key, filename)
with gzip.open(filename, 'rb') as f_in:
with open(file, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
with open(file) as f:
data = f.readlines()
df = pd.DataFrame(data)
jinsert = df.to_json(orient="records")
writefile = open('[container_name]' % (file), 'w')
writefile.write(jinsert)
writefile.close()

How to get a value from JSON

This is the first time I'm working with JSON, and I'm trying to pull url out of the JSON below.
{
"name": "The_New11d112a_Company_Name",
"sections": [
{
"name": "Products",
"payload": [
{
"id": 1,
"name": "TERi Geriatric Patient Skills Trainer,
"type": "string"
}
]
},
{
"name": "Contact Info",
"payload": [
{
"id": 1,
"name": "contacts",
"url": "https://www.a3bs.com/catheterization-kits-8000892-3011958-3b-scientific,p_1057_31043.html",
"contacts": [
{
"name": "User",
"email": "Company Email",
"phone": "Company PhoneNumber"
}
],
"type": "contact"
}
]
}
],
"tags": [
"Male",
"Airway"
],
"_id": "0e4cd5c6-4d2f-48b9-acf2-5aa75ade36e1"
}
I have been able to access description and _id via
data = json.loads(line)
if 'xpath' in data:
xpath = data["_id"]
description = data["sections"][0]["payload"][0]["description"]
However, I can't seem to figure out a way to access url. One other issue I have is there could be other items in sections, which makes indexing into Contact Info a non starter.
Hope this helps:
import json
with open("test.json", "r") as f:
json_out = json.load(f)
for i in json_out["sections"]:
for j in i["payload"]:
for key in j:
if "url" in key:
print(key, '->', j[key])
I think your JSON is damaged, it should be like that.
{
"name": "The_New11d112a_Company_Name",
"sections": [
{
"name": "Products",
"payload": [
{
"id": 1,
"name": "TERi Geriatric Patient Skills Trainer",
"type": "string"
}
]
},
{
"name": "Contact Info",
"payload": [
{
"id": 1,
"name": "contacts",
"url": "https://www.a3bs.com/catheterization-kits-8000892-3011958-3b-scientific,p_1057_31043.html",
"contacts": [
{
"name": "User",
"email": "Company Email",
"phone": "Company PhoneNumber"
}
],
"type": "contact"
}
]
}
],
"tags": [
"Male",
"Airway"
],
"_id": "0e4cd5c6-4d2f-48b9-acf2-5aa75ade36e1"
}
You can check it on http://json.parser.online.fr/.
And if you want to get the value of the url.
import json
j = json.load(open('yourJSONfile.json'))
print(j['sections'][1]['payload'][0]['url'])
I think it's worth to write a short function to get the url(s) and make a decision whether or not to use the first found url in the returned list, or skip processing if there's no url available in your data.
The method shall looks like this:
def extract_urls(data):
payloads = []
for section in data['sections']:
payloads += section.get('payload') or []
urls = [x['url'] for x in payloads if 'url' in x]
return urls
This should print out the URL
import json
# open json file to read
with open('test.json','r') as f:
# load json, parameter as json text (file contents)
data = json.loads(f.read())
# after observing format of JSON data, the location of the URL key
# is determined and the data variable is manipulated to extract the value
print(data['sections'][1]['payload'][0]['url'])
The exact location of the 'url' key:
1st (position) of the array which is the value of the key 'sections'
Inside the array value, there is a dict, and the key 'payload' contains an array
In the 0th (position) of the array is a dict with a key 'url'
While testing my solution, I noticed that the json provided is flawed, after fixing the json flaws(3), I ended up with this.
{
"name": "The_New11d112a_Company_Name",
"sections": [
{
"name": "Products",
"payload": [
{
"id": 1,
"name": "TERi Geriatric Patient Skills Trainer",
"type": "string"
}
]
},
{
"name": "Contact Info",
"payload": [
{
"id": 1,
"name": "contacts",
"url": "https://www.a3bs.com/catheterization-kits-8000892-3011958-3b-scientific,p_1057_31043.html",
"contacts": [
{
"name": "User",
"email": "Company Email",
"phone": "Company PhoneNumber"
}
],
"type": "contact"
}
]
}
],
"tags": [
"Male",
"Airway"
],
"_id": "0e4cd5c6-4d2f-48b9-acf2-5aa75ade36e1"}
After utilizing the JSON that was provided by Vincent55.
I made a working code with exception handling and with certain assumptions.
Working Code:
## Assuming that the target data is always under sections[i].payload
from json import loads
line = open("data.json").read()
data = loads(line)["sections"]
for x in data:
try:
# With assumption that there is only one payload
if x["payload"][0]["url"]:
print(x["payload"][0]["url"])
except KeyError:
pass

Retrieve data from json file using python

I'm new to python. I'm running python on Azure data bricks. I have a .json file. I'm putting the important fields of the json file here
{
"school": [
{
"schoolid": "mr1",
"board": "cbse",
"principal": "akseal",
"schoolName": "dps",
"schoolCategory": "UNKNOWN",
"schoolType": "UNKNOWN",
"city": "mumbai",
"sixhour": true,
"weighting": 3,
"paymentMethods": [
"cash",
"cheque"
],
"contactDetails": [
{
"name": "picsa",
"type": "studentactivities",
"information": [
{
"type": "PHONE",
"detail": "+917597980"
}
]
}
],
"addressLocations": [
{
"locationType": "School",
"address": {
"countryCode": "IN",
"city": "Mumbai",
"zipCode": "400061",
"street": "Madh",
"buildingNumber": "80"
},
"Location": {
"latitude": 49.313885,
"longitude": 72.877426
},
I need to create a data frame with schoolName as one column & latitude & longitude are others two columns. Can you please suggest me how to do that?
you can use the method json.load(), here's an example:
import json
with open('path_to_file/file.json') as f:
data = json.load(f)
print(data)
use this
import json # built-in
with open("filename.json", 'r') as jsonFile:
Data = jsonFile.load()
Data is now a dictionary of the contents exp.
for i in Data:
# loops through keys
print(Data[i]) # prints the value
For more on JSON:
https://docs.python.org/3/library/json.html
and python dictionaries:
https://www.programiz.com/python-programming/dictionary#:~:text=Python%20dictionary%20is%20an%20unordered,when%20the%20key%20is%20known.

Categories

Resources