Pyspark : Convert a CSV to Nested JSON - python

I am new to pyspark. I have a requirement where I need to convert a big CSV file at hdfs location to multiple Nested JSON files based on distinct primaryId.
Sample Input: data.csv
**PrimaryId,FirstName,LastName,City,CarName,DogName**
100,John,Smith,NewYork,Toyota,Spike
100,John,Smith,NewYork,BMW,Spike
100,John,Smith,NewYork,Toyota,Rusty
100,John,Smith,NewYork,BMW,Rusty
101,Ben,Swan,Sydney,Volkswagen,Buddy
101,Ben,Swan,Sydney,Ford,Buddy
101,Ben,Swan,Sydney,Audi,Buddy
101,Ben,Swan,Sydney,Volkswagen,Max
101,Ben,Swan,Sydney,Ford,Max
101,Ben,Swan,Sydney,Audi,Max
102,Julia,Brown,London,Mini,Lucy
Sample Output Files:
File1: Output_100.json
{
"100": [
{
"City": "NewYork",
"FirstName": "John",
"LastName": "Smith",
"CarName": [
"Toyota",
"BMW"
],
"DogName": [
"Spike",
"Rusty"
]
}
}
File2: Output_101.json
{
"101": [
{
"City": "Sydney",
"FirstName": "Ben",
"LastName": "Swan",
"CarName": [
"Volkswagen",
"Ford",
"Audi"
],
"DogName": [
"Buddy",
"Max"
]
}
}
File3: Output_102.json
{
"102": [
{
"City": "London",
"FirstName": "Julia",
"LastName": "Brown",
"CarName": [
"Mini"
],
"DogName": [
"Lucy"
]
}
]
}
Any quick help will be appreciated.

It seems you need to perform a group by on Id and collect Cars and Dogs as a set.
from pyspark.sql.functions import collect_set
df = spark.read.format("csv").option("header", "true").load("cars.csv")
df2 = (
df
.groupBy("PrimaryId","FirstName","LastName")
.agg(collect_set('CarName').alias('CarName'), collect_set('DogName').alias('DogName'))
)
df2.write.format("json").save("cars.json", mode="overwrite")
Generated files:
{"PrimaryId":"100","FirstName":"John","LastName":"Smith","CarName":["Toyota","BMW"],"DogName":["Spike","Rusty"]}
{"PrimaryId":"101","FirstName":"Ben","LastName":"Swan","CarName":["Ford","Volkswagen","Audi"],"DogName":["Max","Buddy"]}
{"PrimaryId":"102","FirstName":"Julia","LastName":"Brown","CarName":["Mini"],"DogName":["Lucy"]}
Let me know if this is what you are looking for.

You can use pandas.groupby() to group by the Id and then iterate over the DataFrameGroupBy object creating the objects and writing the files.
You need to install pandas by $ pip install pandas to your virtualenv.
# coding: utf-8
import json
import pandas as pd
def group_csv_columns(csv_file):
df = pd.read_csv(csv_file)
group_frame = df.groupby(['PrimaryId'])
for i in group_frame:
data_frame = i[1]
data = {}
data[i[0]] = [{
"City": data_frame['City'].unique().tolist()[0],
"FirstName": data_frame['FirstName'].unique().tolist()[0],
"CarName": data_frame['CarName'].unique().tolist(),
'DogName': data_frame['DogName'].unique().tolist(),
'LastName': data_frame['LastName'].unique().tolist()[0],
}]
# Write to file
file_name = 'Output_' + str(i[0]) + '.json'
with open(file_name, 'w') as fh:
contents = json.dumps(data)
fh.write(contents)
group_csv_columns('/tmp/sample.csv')
Call the group_csv_columns() with the file_name with the csv contents.
Check the pandas docs

Related

Update exsisting JSON file

I'm trying to update exsisting JSON file when running my code by adding additional data (package_id). this is the exsisting json contents:
{
"1": {
"age": 10,
"name": [
"ramsi",
"jack",
"adem",
"sara",
],
"skills": []
}
}
and I want to insert a new package and should looks like this:
{"1": {
"age": 10,
"name": [
"ramsi",
"jack",
"adem",
"sara",
],
"skills": []
} "2": {
"age": 14,
"name": [
"maya",
"raji",
],
"skills": ["writing"]
}
}
Issue is when I add the new data it adds --> ({) so (one top-level value) is added twice which is not allowed by JSON standards
{"1": {
"age": 10,
"name": [
"ramsi",
"jack",
"adem",
"sara",
],
"skills": []
}} {"2": {
"age": 14,
"name": [
"maya",
"raji",
],
"skills": ["writing"]
}
}
and this is my code to add the new (package_id):
list1[package_id] = {"age": x, "name": y, "skills": z}
ss = json.dumps(list1, indent=2)
data = []
with open('file.json', 'r+') as f:
data = json.loads(f.read())
data1 = json.dumps(data, indent=2)
f.seek(0)
f.write(data1)
f.write(ss)
f.truncate()
I write to the file twice because if I didn't store existing contents and write it again then it will remove old data and keeps only package_id number 2
It doesn't work that way. You can't add to a JSON record by appending another JSON record. A JSON file always has exactly one object. You need to modify that object.
with open('file.json','r') as f:
data = json.loads(f.read())
data[package_id] = {'age':x, 'name':y, 'skills':z}
with open('file.json','w') as f:
f.write(json.dumps(data,indent=2))

Retrieve data from json file using python

I'm new to python. I'm running python on Azure data bricks. I have a .json file. I'm putting the important fields of the json file here
{
"school": [
{
"schoolid": "mr1",
"board": "cbse",
"principal": "akseal",
"schoolName": "dps",
"schoolCategory": "UNKNOWN",
"schoolType": "UNKNOWN",
"city": "mumbai",
"sixhour": true,
"weighting": 3,
"paymentMethods": [
"cash",
"cheque"
],
"contactDetails": [
{
"name": "picsa",
"type": "studentactivities",
"information": [
{
"type": "PHONE",
"detail": "+917597980"
}
]
}
],
"addressLocations": [
{
"locationType": "School",
"address": {
"countryCode": "IN",
"city": "Mumbai",
"zipCode": "400061",
"street": "Madh",
"buildingNumber": "80"
},
"Location": {
"latitude": 49.313885,
"longitude": 72.877426
},
I need to create a data frame with schoolName as one column & latitude & longitude are others two columns. Can you please suggest me how to do that?
you can use the method json.load(), here's an example:
import json
with open('path_to_file/file.json') as f:
data = json.load(f)
print(data)
use this
import json # built-in
with open("filename.json", 'r') as jsonFile:
Data = jsonFile.load()
Data is now a dictionary of the contents exp.
for i in Data:
# loops through keys
print(Data[i]) # prints the value
For more on JSON:
https://docs.python.org/3/library/json.html
and python dictionaries:
https://www.programiz.com/python-programming/dictionary#:~:text=Python%20dictionary%20is%20an%20unordered,when%20the%20key%20is%20known.

How to get only required columns in Python script while parsing the data from Json File

I am trying to write a python script . As per the requirement I have around 400 columns which will be coming as per of multiple arrays in JSON file.
I am using Pandas library and python version 3.6. I may get more columns than 400 column from the JSON file. How can i restrict the unwanted columns and I want to get only specified columns in my python output file.
I am using below code to get the data as per specified columns.
Issue: In my output file other than mentioned columns in the column list file I am also getting the rest of the columns. How can I restrict the unwanted columns and get only required columns in my output?
with open('Columns.txt') as c:
columns_list = c.readlines()
with open('JsonFile.json') as f:
json_file = json.load(f)
df = pd.DataFrame(columns=columns_list)
and i have one more scneario.. Currently i have data as below sample data.
70 % of cases i have data like [attributes][ABC][Values][Value] and in remaining cases i have [attributes][Xdfghgjgjgj][grp]( here i have some 2 records inside ) . To handle these type of scenario multi valued attributes can you help me with some solution
{
"entities": [
{
"id": "XXXXXXXXXXXXXXX",
"data": {
"attributes": {
"ABC": {
"values": [
{
"value": 00000000000000
}
]
}
"Xdfghgjgjgj": {
"grp": [
{
"SUPP": {
"values": [
{
"value": "000000000000000000"
}
]
},
"yfyfyfyfyfy": {
"values": [
{
"value": "909000090099090"
}
]
},
},
{
"SUPP": {
"values": [
{
"value": "000000000000000000"
}
]
},
"yfyfyfyfyfy": {
"values": [
{
"value": "909000090099090"
}
]
},
}
]
}
}
there is a way to read specific columns from csv using pandas :
import pandas as pd
cols= ['col1', 'col2', 'col3']
df = pd.read_csv('JsonFile.csv', skipinitialspace=True, usecols=cols)
#save to output
df.to_csv('output.csv',Index=False)
or you could specify the columns when you are saving your file :
df = pd.read_csv('JsonFile.csv')
df[column_names].to_csv('output.csv',index=False)
Edit :
with open('Columns.txt') as c:
columns_list = c.readlines()
with open('JsonFile.json') as f:
json_file = json.load(f)
#df = pd.DataFrame.from_dict(json_file, orient='columns')
df = pd.DataFrame(json_file)
df[columns_list].to_csv('output.csv',index=False)

how to convert jason file data into csv file in python

I have .json file which have this type of data ,the name of universities of the world
[
{
"web_pages": [
"https://www.cstj.qc.ca",
"https://ccmt.cstj.qc.ca",
"https://ccml.cstj.qc.ca"
],
"name": "Cégep de Saint-Jérôme",
"alpha_two_code": "CA",
"state-province": null,
"domains": [
"cstj.qc.ca"
],
"country": "Canada"
},
{
"web_pages": [
"http://www.lindenwood.edu/"
],
"name": "Lindenwood University",
"alpha_two_code": "US",
"state-province": null,
"domains": [
"lindenwood.edu"
],
"country": "United States"
},
{
"web_pages": [
.......
.....
....
...
Continue......
I want to convert this .json file into CSV using Python, What will be its solution to make the CSV file?
This solution uses Pandas.
import json
from pandas.io.json import json_normalize
with open('infile.json') as json_data:
d = json.load(json_data)
df = json_normalize(d)
df.to_csv('outfile.csv', index=False)
Also, as #LucaBezerra has mentioned in the comments, the current text has some encoding problem which you might want to fix (look at the first "name").

Update a specific key in JSON Array using PYTHON

I have a JSON file which has some key-value pairs in Arrays. I need to update/replace the value for key id with a value stored in a variable called Var1
The problem is that when I run my python code, it adds the new key-value pair in outside the inner array instead of replacing:
PYTHON SCRIPT:
import json
import sys
var1=abcdefghi
with open('C:\\Projects\\scripts\\input.json', 'r+') as f:
json_data = json.load(f)
json_data['id'] = var1
f.seek(0)
f.write(json.dumps(json_data))
f.truncate()
INPUT JSON:
{
"channel": "AT",
"username": "Maintenance",
"attachments": [
{
"fallback":"[Urgent]:",
"pretext":"[Urgent]:",
"color":"#D04000",
"fields":[
{
"title":"SERVERS:",
"id":"popeye",
"short":false
}
]
}
]
}
OUTPUT:
{
"username": "Maintenance",
"attachments": [
{
"color": "#D04000",
"pretext": "[Urgent]:",
"fallback": "[Urgent]:",
"fields": [
{
"short": false,
"id": "popeye",
"title": "SERVERS:"
}
]
}
],
"channel": "AT",
"id": "abcdefghi"
}
Below will update the id inside fields :
json_data['attachments'][0]['fields'][0]['id'] = var1

Categories

Resources