how to convert jason file data into csv file in python - python

I have .json file which have this type of data ,the name of universities of the world
[
{
"web_pages": [
"https://www.cstj.qc.ca",
"https://ccmt.cstj.qc.ca",
"https://ccml.cstj.qc.ca"
],
"name": "Cégep de Saint-Jérôme",
"alpha_two_code": "CA",
"state-province": null,
"domains": [
"cstj.qc.ca"
],
"country": "Canada"
},
{
"web_pages": [
"http://www.lindenwood.edu/"
],
"name": "Lindenwood University",
"alpha_two_code": "US",
"state-province": null,
"domains": [
"lindenwood.edu"
],
"country": "United States"
},
{
"web_pages": [
.......
.....
....
...
Continue......
I want to convert this .json file into CSV using Python, What will be its solution to make the CSV file?

This solution uses Pandas.
import json
from pandas.io.json import json_normalize
with open('infile.json') as json_data:
d = json.load(json_data)
df = json_normalize(d)
df.to_csv('outfile.csv', index=False)
Also, as #LucaBezerra has mentioned in the comments, the current text has some encoding problem which you might want to fix (look at the first "name").

Related

How to read an array inside an array in JSON using python

I am trying to read the "transcript" from my json file and i just can't get inside it. I can access the main JSON but not the specific part of it.
My JSON file
"results": [ {
"alternatives": [ {
"confidence": 0.90822744,
"transcript":"bunch of words" ,
"words": [ {
"endTime": "1.400s",
"startTime": "0s",
"word": "bunch"
},...
My Code snippet
f = open('C:\\Users\\Kerem\\Desktop\\YeniGoogleapi\\MugeAnliileTatliSert7Kasim.json', encoding="utf8")
data = json.load(f)
for i in data['results']:
print(i)

Retrieve data from json file using python

I'm new to python. I'm running python on Azure data bricks. I have a .json file. I'm putting the important fields of the json file here
{
"school": [
{
"schoolid": "mr1",
"board": "cbse",
"principal": "akseal",
"schoolName": "dps",
"schoolCategory": "UNKNOWN",
"schoolType": "UNKNOWN",
"city": "mumbai",
"sixhour": true,
"weighting": 3,
"paymentMethods": [
"cash",
"cheque"
],
"contactDetails": [
{
"name": "picsa",
"type": "studentactivities",
"information": [
{
"type": "PHONE",
"detail": "+917597980"
}
]
}
],
"addressLocations": [
{
"locationType": "School",
"address": {
"countryCode": "IN",
"city": "Mumbai",
"zipCode": "400061",
"street": "Madh",
"buildingNumber": "80"
},
"Location": {
"latitude": 49.313885,
"longitude": 72.877426
},
I need to create a data frame with schoolName as one column & latitude & longitude are others two columns. Can you please suggest me how to do that?
you can use the method json.load(), here's an example:
import json
with open('path_to_file/file.json') as f:
data = json.load(f)
print(data)
use this
import json # built-in
with open("filename.json", 'r') as jsonFile:
Data = jsonFile.load()
Data is now a dictionary of the contents exp.
for i in Data:
# loops through keys
print(Data[i]) # prints the value
For more on JSON:
https://docs.python.org/3/library/json.html
and python dictionaries:
https://www.programiz.com/python-programming/dictionary#:~:text=Python%20dictionary%20is%20an%20unordered,when%20the%20key%20is%20known.

Pyspark : Convert a CSV to Nested JSON

I am new to pyspark. I have a requirement where I need to convert a big CSV file at hdfs location to multiple Nested JSON files based on distinct primaryId.
Sample Input: data.csv
**PrimaryId,FirstName,LastName,City,CarName,DogName**
100,John,Smith,NewYork,Toyota,Spike
100,John,Smith,NewYork,BMW,Spike
100,John,Smith,NewYork,Toyota,Rusty
100,John,Smith,NewYork,BMW,Rusty
101,Ben,Swan,Sydney,Volkswagen,Buddy
101,Ben,Swan,Sydney,Ford,Buddy
101,Ben,Swan,Sydney,Audi,Buddy
101,Ben,Swan,Sydney,Volkswagen,Max
101,Ben,Swan,Sydney,Ford,Max
101,Ben,Swan,Sydney,Audi,Max
102,Julia,Brown,London,Mini,Lucy
Sample Output Files:
File1: Output_100.json
{
"100": [
{
"City": "NewYork",
"FirstName": "John",
"LastName": "Smith",
"CarName": [
"Toyota",
"BMW"
],
"DogName": [
"Spike",
"Rusty"
]
}
}
File2: Output_101.json
{
"101": [
{
"City": "Sydney",
"FirstName": "Ben",
"LastName": "Swan",
"CarName": [
"Volkswagen",
"Ford",
"Audi"
],
"DogName": [
"Buddy",
"Max"
]
}
}
File3: Output_102.json
{
"102": [
{
"City": "London",
"FirstName": "Julia",
"LastName": "Brown",
"CarName": [
"Mini"
],
"DogName": [
"Lucy"
]
}
]
}
Any quick help will be appreciated.
It seems you need to perform a group by on Id and collect Cars and Dogs as a set.
from pyspark.sql.functions import collect_set
df = spark.read.format("csv").option("header", "true").load("cars.csv")
df2 = (
df
.groupBy("PrimaryId","FirstName","LastName")
.agg(collect_set('CarName').alias('CarName'), collect_set('DogName').alias('DogName'))
)
df2.write.format("json").save("cars.json", mode="overwrite")
Generated files:
{"PrimaryId":"100","FirstName":"John","LastName":"Smith","CarName":["Toyota","BMW"],"DogName":["Spike","Rusty"]}
{"PrimaryId":"101","FirstName":"Ben","LastName":"Swan","CarName":["Ford","Volkswagen","Audi"],"DogName":["Max","Buddy"]}
{"PrimaryId":"102","FirstName":"Julia","LastName":"Brown","CarName":["Mini"],"DogName":["Lucy"]}
Let me know if this is what you are looking for.
You can use pandas.groupby() to group by the Id and then iterate over the DataFrameGroupBy object creating the objects and writing the files.
You need to install pandas by $ pip install pandas to your virtualenv.
# coding: utf-8
import json
import pandas as pd
def group_csv_columns(csv_file):
df = pd.read_csv(csv_file)
group_frame = df.groupby(['PrimaryId'])
for i in group_frame:
data_frame = i[1]
data = {}
data[i[0]] = [{
"City": data_frame['City'].unique().tolist()[0],
"FirstName": data_frame['FirstName'].unique().tolist()[0],
"CarName": data_frame['CarName'].unique().tolist(),
'DogName': data_frame['DogName'].unique().tolist(),
'LastName': data_frame['LastName'].unique().tolist()[0],
}]
# Write to file
file_name = 'Output_' + str(i[0]) + '.json'
with open(file_name, 'w') as fh:
contents = json.dumps(data)
fh.write(contents)
group_csv_columns('/tmp/sample.csv')
Call the group_csv_columns() with the file_name with the csv contents.
Check the pandas docs

I am searching to convert complex json to csv using python or R

It would be helpful if values are converted as rows of CSV and keys are received a columns of CSV.
{
"_id": {
"$uId”: “12345678”
},
“comopany_productId”: “J00354”,
“`company_product name`”: “BIKE 12345”,
"search_results": [
{
“product_id”: "44zIVQ",
"constituents”: [
{
“tyre”: “2”,
"name": “dunlop”
},
{
"strength": “able to move 100 km”,
"name": “MRF”
}
],
"name": “Yhakohuka”,
"form": “tyre”,
"schedule": {
"category": “a”,
"label": "It needs a good car to fit in”
},
"standardUnits": 20,
"price": 2000,
"search_score”:0.947474,
“Form”: “tyre”,
"manufacturer": “hum”,
"id": “12345678”,
"size": “4”
},
I want uId, company_productId”, "company_product name", various keys in search_results “tyre”,"name",""strength","name","form","schedule","category","label","standard units","price","search_score","Form","manufacturer","id","size" ias difference column in excel and values as rows.
In python you can use libraries pandas and json to convert it to a csv like this:
from pandas.io.json import json_normalize
import json
json_normalize(json.loads('your_json_string')).to_csv('file_name.csv')
If you have your json saved on a file, use json.load instead, passing the file object to it.

How do I parse multiple levels(arrays) within JSON file using python?

Here is the sample JSON which I am trying to parse in python.
I am having hard time parsing through "files":
Any help appreciated.
{
"startDate": "2016-02-19T08:19:30.764-0700",
"endDate": "2016-02-19T08:20:19.058-07:00",
"files": [
{
"createdOn": "2017-02-19T08:20:19.391-0700",
"contentType": "text/plain",
"type": "jpeg",
"id": "Photoshop",
"fileUri": "output.log"
}
],
"status": "OK",
"linkToken": "-3418849688029673541",
"subscriberViewers": [
"null"
]
}
To print the id of each file in the array:
import json
data = json.loads(rawdata)
files = data['files']
for f in files:
print(f['id'])

Categories

Resources