I am trying to read the "transcript" from my json file and i just can't get inside it. I can access the main JSON but not the specific part of it.
My JSON file
"results": [ {
"alternatives": [ {
"confidence": 0.90822744,
"transcript":"bunch of words" ,
"words": [ {
"endTime": "1.400s",
"startTime": "0s",
"word": "bunch"
},...
My Code snippet
f = open('C:\\Users\\Kerem\\Desktop\\YeniGoogleapi\\MugeAnliileTatliSert7Kasim.json', encoding="utf8")
data = json.load(f)
for i in data['results']:
print(i)
I'm new to python. I'm running python on Azure data bricks. I have a .json file. I'm putting the important fields of the json file here
{
"school": [
{
"schoolid": "mr1",
"board": "cbse",
"principal": "akseal",
"schoolName": "dps",
"schoolCategory": "UNKNOWN",
"schoolType": "UNKNOWN",
"city": "mumbai",
"sixhour": true,
"weighting": 3,
"paymentMethods": [
"cash",
"cheque"
],
"contactDetails": [
{
"name": "picsa",
"type": "studentactivities",
"information": [
{
"type": "PHONE",
"detail": "+917597980"
}
]
}
],
"addressLocations": [
{
"locationType": "School",
"address": {
"countryCode": "IN",
"city": "Mumbai",
"zipCode": "400061",
"street": "Madh",
"buildingNumber": "80"
},
"Location": {
"latitude": 49.313885,
"longitude": 72.877426
},
I need to create a data frame with schoolName as one column & latitude & longitude are others two columns. Can you please suggest me how to do that?
you can use the method json.load(), here's an example:
import json
with open('path_to_file/file.json') as f:
data = json.load(f)
print(data)
use this
import json # built-in
with open("filename.json", 'r') as jsonFile:
Data = jsonFile.load()
Data is now a dictionary of the contents exp.
for i in Data:
# loops through keys
print(Data[i]) # prints the value
For more on JSON:
https://docs.python.org/3/library/json.html
and python dictionaries:
https://www.programiz.com/python-programming/dictionary#:~:text=Python%20dictionary%20is%20an%20unordered,when%20the%20key%20is%20known.
I am new to pyspark. I have a requirement where I need to convert a big CSV file at hdfs location to multiple Nested JSON files based on distinct primaryId.
Sample Input: data.csv
**PrimaryId,FirstName,LastName,City,CarName,DogName**
100,John,Smith,NewYork,Toyota,Spike
100,John,Smith,NewYork,BMW,Spike
100,John,Smith,NewYork,Toyota,Rusty
100,John,Smith,NewYork,BMW,Rusty
101,Ben,Swan,Sydney,Volkswagen,Buddy
101,Ben,Swan,Sydney,Ford,Buddy
101,Ben,Swan,Sydney,Audi,Buddy
101,Ben,Swan,Sydney,Volkswagen,Max
101,Ben,Swan,Sydney,Ford,Max
101,Ben,Swan,Sydney,Audi,Max
102,Julia,Brown,London,Mini,Lucy
Sample Output Files:
File1: Output_100.json
{
"100": [
{
"City": "NewYork",
"FirstName": "John",
"LastName": "Smith",
"CarName": [
"Toyota",
"BMW"
],
"DogName": [
"Spike",
"Rusty"
]
}
}
File2: Output_101.json
{
"101": [
{
"City": "Sydney",
"FirstName": "Ben",
"LastName": "Swan",
"CarName": [
"Volkswagen",
"Ford",
"Audi"
],
"DogName": [
"Buddy",
"Max"
]
}
}
File3: Output_102.json
{
"102": [
{
"City": "London",
"FirstName": "Julia",
"LastName": "Brown",
"CarName": [
"Mini"
],
"DogName": [
"Lucy"
]
}
]
}
Any quick help will be appreciated.
It seems you need to perform a group by on Id and collect Cars and Dogs as a set.
from pyspark.sql.functions import collect_set
df = spark.read.format("csv").option("header", "true").load("cars.csv")
df2 = (
df
.groupBy("PrimaryId","FirstName","LastName")
.agg(collect_set('CarName').alias('CarName'), collect_set('DogName').alias('DogName'))
)
df2.write.format("json").save("cars.json", mode="overwrite")
Generated files:
{"PrimaryId":"100","FirstName":"John","LastName":"Smith","CarName":["Toyota","BMW"],"DogName":["Spike","Rusty"]}
{"PrimaryId":"101","FirstName":"Ben","LastName":"Swan","CarName":["Ford","Volkswagen","Audi"],"DogName":["Max","Buddy"]}
{"PrimaryId":"102","FirstName":"Julia","LastName":"Brown","CarName":["Mini"],"DogName":["Lucy"]}
Let me know if this is what you are looking for.
You can use pandas.groupby() to group by the Id and then iterate over the DataFrameGroupBy object creating the objects and writing the files.
You need to install pandas by $ pip install pandas to your virtualenv.
# coding: utf-8
import json
import pandas as pd
def group_csv_columns(csv_file):
df = pd.read_csv(csv_file)
group_frame = df.groupby(['PrimaryId'])
for i in group_frame:
data_frame = i[1]
data = {}
data[i[0]] = [{
"City": data_frame['City'].unique().tolist()[0],
"FirstName": data_frame['FirstName'].unique().tolist()[0],
"CarName": data_frame['CarName'].unique().tolist(),
'DogName': data_frame['DogName'].unique().tolist(),
'LastName': data_frame['LastName'].unique().tolist()[0],
}]
# Write to file
file_name = 'Output_' + str(i[0]) + '.json'
with open(file_name, 'w') as fh:
contents = json.dumps(data)
fh.write(contents)
group_csv_columns('/tmp/sample.csv')
Call the group_csv_columns() with the file_name with the csv contents.
Check the pandas docs
It would be helpful if values are converted as rows of CSV and keys are received a columns of CSV.
{
"_id": {
"$uId”: “12345678”
},
“comopany_productId”: “J00354”,
“`company_product name`”: “BIKE 12345”,
"search_results": [
{
“product_id”: "44zIVQ",
"constituents”: [
{
“tyre”: “2”,
"name": “dunlop”
},
{
"strength": “able to move 100 km”,
"name": “MRF”
}
],
"name": “Yhakohuka”,
"form": “tyre”,
"schedule": {
"category": “a”,
"label": "It needs a good car to fit in”
},
"standardUnits": 20,
"price": 2000,
"search_score”:0.947474,
“Form”: “tyre”,
"manufacturer": “hum”,
"id": “12345678”,
"size": “4”
},
I want uId, company_productId”, "company_product name", various keys in search_results “tyre”,"name",""strength","name","form","schedule","category","label","standard units","price","search_score","Form","manufacturer","id","size" ias difference column in excel and values as rows.
In python you can use libraries pandas and json to convert it to a csv like this:
from pandas.io.json import json_normalize
import json
json_normalize(json.loads('your_json_string')).to_csv('file_name.csv')
If you have your json saved on a file, use json.load instead, passing the file object to it.
Here is the sample JSON which I am trying to parse in python.
I am having hard time parsing through "files":
Any help appreciated.
{
"startDate": "2016-02-19T08:19:30.764-0700",
"endDate": "2016-02-19T08:20:19.058-07:00",
"files": [
{
"createdOn": "2017-02-19T08:20:19.391-0700",
"contentType": "text/plain",
"type": "jpeg",
"id": "Photoshop",
"fileUri": "output.log"
}
],
"status": "OK",
"linkToken": "-3418849688029673541",
"subscriberViewers": [
"null"
]
}
To print the id of each file in the array:
import json
data = json.loads(rawdata)
files = data['files']
for f in files:
print(f['id'])