How to create dataframe from nested JSON? - python

I'm trying to create a dataframe using the following JSON structure -
{
"tables" : {
"name" : "PrimaryResult",
"columns" : [
{
"name" : "EmployeeID",
"type" : "Int"
},
{
"name" : "EmployeeName",
"type" : "String"
},
{
"name" : "DepartmentName",
"type" : "String"
}
],
"rows" : [
[
123,
"John Doe",
"IT"
],
[
234,
"Jane Doe",
"HR"
]
]
}
}
I tried few of the suggestions from - How to create pandas DataFrame from nested Json with list , How to parse nested JSON objects in spark sql?.
But I'm still confused. Essentially the output should look somewhat like below -
+----------+------------+--------------+
|EmployeeId|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+
I'm trying to refrain from using pandas as it shows lot of memory issues if the data is huge (not sure if there is a way to handle them).
Please help.

See below logic -
import json
data = [json.loads(js)]
print(data)
# Output
[{'tables': {'name': 'PrimaryResult', 'columns': [{'name': 'EmployeeID', 'type': 'Int'}, {'name': 'EmployeeName', 'type': 'String'}, {'name': 'DepartmentName', 'type': 'String'}], 'rows': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}}]
Now fetch the columns as below -
columns = []
for i in range(len(data[0]['tables']['columns'])):
columns.append(data[0]['tables']['columns'][i]['name'])
print(columns)
#Output
['EmployeeID', 'EmployeeName', 'DepartmentName']
Create a dictionary of columns and rows as below -
dict_JSON = {}
dict_JSON["columns"] = columns
dict_JSON["data"] = data[0]['tables']['rows']
print(dict_JSON)
#Output
{'columns': ['EmployeeID', 'EmployeeName', 'DepartmentName'], 'data': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}
Now once you have this dictionary create pandas dataframe and from there create the spark dataframe as below -
import pandas as pd
pdf = pd.read_json(json.dumps(dict_JSON), orient='split')
df = spark.createDataFrame(pdf)
df.show()
+----------+------------+--------------+
|EmployeeID|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+

Related

Convert parquet to list of objects in python

I am reading a parquet file with panda:
import pandas as pd
df = pd.read_parquet('myfile.parquet', engine='pyarrow')
The file has the following structure:
company_id
user_id
attribute_name
attribute_value
timestamp
1
116664
111f07000612
first_name
Tom
2022-03-23 17:11:58
2
116664
111f07000612
last_name
Cruise
2022-03-23 17:11:58
3
116664
111f07000612
city
New York
2022-03-23 17:11:58
4
116664
abcf0700d009d122
first_name
Matt
2022-02-23 10:11:59
5
116664
abcf0700d009d122
last_name
Damon
2022-02-23 10:11:59
I would like to group by user_id and generate a list of objects (that will be stored as json) with the following format:
[
{
"user_id": "111f07000612",
"first_name": "Tom",
"last_name": "Cruise",
"city": "New York"
},
{
"user_id": "abcf0700d009d122",
"first_name": "Matt",
"last_name": "Damon"
}
]
Hi 👋🏻 Hope you are doing well!
You can achieve it with something similar to this 🙂
from pprint import pprint
import pandas as pd
# because I don't have the exact parquet file, I will just mock it
# df = pd.read_parquet("myfile.parquet", engine="pyarrow")
df = pd.DataFrame(
{
"company_id": [116664, 116664, 116664, 116664, 116664],
"user_id": ["111f07000612", "111f07000612", "111f07000612", "abcf0700d009d122", "abcf0700d009d122"],
"attribute_name": ["first_name", "last_name", "city", "first_name", "last_name"],
"attribute_value": ["Tom", "Cruise", "New York", "Matt", "Damon"],
"timestamp": ["2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58"]
}
)
records = []
for user_id, group in df.groupby("user_id"):
transformed_group = (
group[["attribute_name", "attribute_value"]]
.set_index("attribute_name")
.transpose()
.assign(user_id=user_id)
)
rercord, *_ = transformed_group.to_dict("records")
records.append(rercord)
pprint(records)
# [{'city': 'New York',
# 'first_name': 'Tom',
# 'last_name': 'Cruise',
# 'user_id': '111f07000612'},
# {'first_name': 'Matt', 'last_name': 'Damon', 'user_id': 'abcf0700d009d122'}]

Json file to pandas data frame

I have a JSON file look like below.
myjson= {'data': [{'ID': 'da45e00ca',
'name': 'June_2016',
'objCode': 'ased',
'percentComplete': 4.17,
'plannedCompletionDate': '2021-04-29T10:00:00:000-0500',
'plannedStartDate': '2020-04-16T23:00:00:000-0500',
'priority': 4,
'asedectedCompletionDate': '2022-02-09T10:00:00:000-0600',
'status': 'weds'},
{'ID': '10041ce23c',
'name': '2017_Always',
'objCode': 'ased',
'percentComplete': 4.17,
'plannedCompletionDate': '2021-10-22T10:00:00:000-0600',
'plannedStartDate': '2021-08-09T23:00:00:000-0600',
'priority': 3,
'asedectedCompletionDate': '2023-12-30T11:05:00:000-0600',
'status': 'weds'},
{'ID': '10041ce23ca',
'name': '2017_Always',
'objCode': 'ased',
'percentComplete': 4.17,
'plannedCompletionDate': '2021-10-22T10:00:00:000-0600',
'plannedStartDate': '2021-08-09T23:00:00:000-0600',
'priority': 3,
'asedectedCompletionDate': '2023-12-30T11:05:00:000-0600',
'status': 'weds'}]}
I was trying to normalize it convert it to pandas DF using the below code but doesn't seem to come correct
from pandas.io.json import json_normalize
reff = json_normalize(myjson)
df = pd.DataFrame(data=reff)
df
Can someone have any idea what I'm doing wrong? Thanks in advance!
Try:
import pandas as pd
reff = pd.json_normalize(myjson['data'])
df = pd.DataFrame(data=reff)
df
You forgot to pull your data out of myjson. json_normalize() will iterate through the most outer-layer of your JSON.
This method first normalizes the json data and then converts it into the pandas dataframe. You would have to import this method from the pandas module.
Step 1 - Load the json data
json.loads(json_string)
Step 2 - Pass the loaded data into json_normalize() method
json_normalize(json.loads(json_string))
Example:
import pandas as pd
import json
# Create json string
# with student details
json_string = '''
[
{ "id": "1", "name": "sravan","age":22 },
{ "id": "2", "name": "harsha","age":22 },
{ "id": "3", "name": "deepika","age":21 },
{ "id": "4", "name": "jyothika","age":23 }
]
'''
# Load json data and convert to Dataframe
df = pd.json_normalize(json.loads(json_string))
# Display the Dataframe
print(df)
Output:
id name age
0 1 sravan 22
1 2 harsha 22
2 3 deepika 21
3 4 jyothika 23

Generalize algorithm for a loop comparing to last record?

I have a data set which I can represent by this toy example of a list of dictionaries:
data = [{
"_id" : "001",
"Location" : "NY",
"start_date" : "2022-01-01T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "002",
"Location" : "NY",
"start_date" : "2022-01-02T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "011",
"Location" : "NY",
"start_date" : "2022-02-01T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "012",
"Location" : "NY",
"Start_Date" : "2022-02-02T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "101",
"Location" : "NY",
"Start_Date" : "2022-03-01T00:00:00Z",
"Baz" : "pizza"
},
{
"_id" : "102",
"Location" : "NY",
"Start_Date" : "2022-03-2T00:00:00Z",
"Baz" : "pizza"
},
]
Here is an algorithm in Python which collects each of the keys in each 'collection' and whenever there is a key change, the algorithm adds those keys to output.
data_keys = []
for i, lst in enumerate(data):
all_keys = []
for k, v in lst.items():
all_keys.append(k)
if k.lower() == 'start_date':
start_date = v
this_coll = {'start_date': start_date, 'all_keys': all_keys}
if i == 0:
data_keys.append(this_coll)
else:
last_coll = data_keys[-1]
if this_coll['all_keys'] == last_coll['all_keys']:
continue
else:
data_keys.append(this_coll)
The correct output given here records each change of field name: Foo, Bar, Baz as well as the change of case in field start_date to Start_Date:
[{'start_date': '2022-01-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Foo']},
{'start_date': '2022-02-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Bar']},
{'start_date': '2022-02-02T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']},
{'start_date': '2022-03-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]
Is there a general algorithm which covers this pattern comparing current to previous item in a stack?
I need to generalize this algorithm and find a solution to do exactly the same thing with MongoDB documents in a collection. In order for me to discover if Mongo has an Aggregation Pipeline Operator which I could use, I must first understand if this basic algorithm has other common forms so I know what to look for.
Or someone who knows MongoDB aggregation pipelines really well could suggest operators which would produce the desired result?
EDIT: If you want to use a query for this, one option is something like:
The $objectToArray allow to format the keys as values, and the $ifNull allows to check several options of start_date.
The $unwind allows us to sort the keys.
The $group allow us to undo the $unwind, but now with sorted keys
$reduce to create a string from all keys, so we'll have something to compare.
group again, but now with our string, so we'll only have documents for changes.
db.collection.aggregate([
{
$project: {
data: {$objectToArray: "$$ROOT"},
start_date: {$ifNull: ["$start_date", "$Start_Date"]}
}
},
{$unwind: "$data"},
{$project: {start_date: 1, key: "$data.k", _id: 0}},
{$sort: {start_date: 1, key: 1}},
{$group: {_id: "$start_date", all_keys: {$push: "$key"}}},
{
$project: {
all_keys: 1,
all_keys_string: {
$reduce: {
input: "$all_keys",
initialValue: "",
in: {$concat: ["$$value", "$$this"]}
}
}
}
},
{
$group: {
_id: "$all_keys_string",
all_keys: {$first: "$all_keys"},
start_date: {$first: "$_id"}
}
},
{$unset: "_id"}
])
Playground example
itertools.groupby iterates subiterators when a key value has changed. It does the work of tracking a changing key for you. In your case, that's the keys of the dictionary. You can create a list comprehension that takes the first value from each of these subiterators.
import itertools
data = ... your data ...
data_keys = [next(val)
for _, val in itertools.groupby(data, lambda record: record.keys())]
for row in data_keys:
print(row)
Result
{'_id': '001', 'Location': 'NY', 'start_date': '2022-01-01T00:00:00Z', 'Foo': 'fruits'}
{'_id': '011', 'Location': 'NY', 'start_date': '2022-02-01T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '012', 'Location': 'NY', 'Start_Date': '2022-02-02T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '101', 'Location': 'NY', 'Start_Date': '2022-03-01T00:00:00Z', 'Baz': 'pizza'}

How to transpose JSON structs and arrays in PySpark

I have the following Json file that I'm reading into a dataframe.
{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}
How could I manipulate this to get the following output?
Team
Touchdowns
Texans
123
Ravens
456
I'm struggling a bit with whether I need to pivot/transpose the data or if there is a more elegant approach.
Read the multiline json into spark
df = spark.read.json('/path/to/scores.json',multiLine=True)
Schema
df:pyspark.sql.dataframe.DataFrame
details:struct
box:array
element:struct
Touchdowns:string
field:string
name:string
All of the info you want is in the first row, so get that and drill down to details and box and make that your new dataframe.
spark.createDataFrame(df.first()['details']['box']).withColumnRenamed('field','Team').show()
Output
+----------+------+
|Touchdowns| Team|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+
You can use the inline function.
df = spark.read.load(json_file_path, format='json', multiLine=True)
df = df.selectExpr('inline(details.box)').withColumnRenamed('field', 'Team')
df.show(truncate=False)
You can try using a rdd to get the values of box list.
Input JSON
jsonstr="""{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}"""
Now convert it to an rdd using the keys of dictionary as below -
import json
box_rdd = sc.parallelize(json.loads(jsonstr)['details']['box'])
box_rdd.collect()
Output - [{'Touchdowns': '123', 'field': 'Texans'},
{'Touchdowns': '456', 'field': 'Ravens'}]
Finally create the dataframe with this box_rdd as below -
from pyspark.sql.types import *
schema = StructType([StructField('Touchdowns', StringType(), True), StructField('field', StringType(), True)])
df = spark.createDataFrame(data=box_rdd,schema=schema)
df.show()
+----------+------+
|Touchdowns| field|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+

Define specific json export format of pandas dataframe

I need to export my DF into a specific JSON format, but I'm struggling to format it in the right way.
I'd like to create a subsection with shop_details that show the city and location for the shop if it's known, otherwise it should be left empty.
Code for my DF:
from pandas import DataFrame
Data = {'item_type': ['Iphone','Computer','Computer'],
'purch_price': [1200,700,700],
'sale_price': [1150,'NaN','NaN'],
'city': ['NaN','Los Angeles','San Jose'],
'location': ['NaN','1st street', '2nd street']
}
DF looks like this:
item_type purch_price sale_price city location
0 Iphone 1200 1150 NaN NaN
1 Computer 700 NaN Los Angeles 1st street
2 Computer 700 NaN San Jose 2nd street
The output format should look like below:
[{
"item_type": "Iphone",
"purch_price": "1200",
"sale_price": "1150",
"shop_details": []
},
{
"item_type": "Computer",
"purch_price": "700",
"sale_price": "600",
"shop_details": [{
"city": "Los Angeles",
"location": "1st street"
},
{
"city": "San Jose",
"location": "2nd street"
}
]
}
]
import json
df = df.fillna('')
def shop_details(row):
if row['city'] != '' and row['location'] !='':
return [{'city': row['city'], 'location': row['location']}]
else:
return []
df['shop_details'] = df.apply(lambda row: shop_details(row), axis = 1)
df = df.drop(['city', 'location'], axis = 1)
json.dumps(df.to_dict('records'))
Only problem is we do not group by item_type, but you should do some of the work ;)
You can do like below to achieve your output. Thanks
from pandas import DataFrame
Data = {'item_type': ['Iphone','Computer','Computer'],
'purch_price': [1200,700,700],
'sale_price': [1150,'NaN','NaN'],
'city': ['NaN','Los Angeles','San Jose'],
'location': ['NaN','1st street', '2nd street']
}
df = DataFrame(Data, columns= ['item_type', 'purch_price', 'sale_price', 'city','location' ])
Export = df.to_json ('path where you want to export your json file')

Categories

Resources