How to transpose JSON structs and arrays in PySpark

How to transpose JSON structs and arrays in PySpark - python

I have the following Json file that I'm reading into a dataframe.
{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}
How could I manipulate this to get the following output?
Team
Touchdowns
Texans
123
Ravens
456
I'm struggling a bit with whether I need to pivot/transpose the data or if there is a more elegant approach.

Read the multiline json into spark
df = spark.read.json('/path/to/scores.json',multiLine=True)
Schema
df:pyspark.sql.dataframe.DataFrame
details:struct
box:array
element:struct
Touchdowns:string
field:string
name:string
All of the info you want is in the first row, so get that and drill down to details and box and make that your new dataframe.
spark.createDataFrame(df.first()['details']['box']).withColumnRenamed('field','Team').show()
Output
+----------+------+
|Touchdowns| Team|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+

You can use the inline function.
df = spark.read.load(json_file_path, format='json', multiLine=True)
df = df.selectExpr('inline(details.box)').withColumnRenamed('field', 'Team')
df.show(truncate=False)

You can try using a rdd to get the values of box list.
Input JSON
jsonstr="""{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}"""
Now convert it to an rdd using the keys of dictionary as below -
import json
box_rdd = sc.parallelize(json.loads(jsonstr)['details']['box'])
box_rdd.collect()
Output - [{'Touchdowns': '123', 'field': 'Texans'},
{'Touchdowns': '456', 'field': 'Ravens'}]
Finally create the dataframe with this box_rdd as below -
from pyspark.sql.types import *
schema = StructType([StructField('Touchdowns', StringType(), True), StructField('field', StringType(), True)])
df = spark.createDataFrame(data=box_rdd,schema=schema)
df.show()
+----------+------+
|Touchdowns| field|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+

Related

By using PySpark how to parse nested JSON (Blob format)

I'm getting the following records in blob format with a new line separated. Below is an example of two events separated by a newline,
Few things to note here,
In the example below, event(Structure) are in inconsistent. For certain events i will get Channel Id,conversation Id,replyActivity Id,from Id,locale columns and for absent columns i need to populate as null in my data frame.
How will i able to achive this in Pyspark ?
Example:
{
"event":[
{
"name":"Zip/Postal Code",
"count":1
}
],
"internal":{
"data":{
"id":"XXXX",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:48.7734934Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eus2",
"roleInstance":"RD0004FFA145F5",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"f115c4bf-4fa31385d9a8f248",
"parentId":"|f115c4bf-4fa31385d9a8f248."
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Boydton"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000006"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|f115c4bf-4fa31385d9a8f248."
},
{
"Channel ID":"directline"
},
{
"Recipient ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums#Ye6TP1LJz0o"
},
{
"Bot ID":"XXXX"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}{
"event":[
{
"name":"Activity",
"count":1
}
],
"internal":{
"data":{
"id":"992b0fc7-12ec-11eb-b59a-fb2df7d234d8",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:34.3811795Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eastus3",
"roleInstance":"RD00155D33F838",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00",
"parentId":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Washington"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000000"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
{
"Channel ID":"directline"
},
{
"Bot ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}
I need to extract these records in to following table format (Column Name mentioned below),
ActivityId | ActivityType | ChannelId | conversationId | replyActivityId | fromId | locale | recipientId | speak | text | name |eventTime | Date | InstanceId | DialogId | StepName | applicationId | intent | intentScore | entities | question | sentimentLabel | sentimentScore | knowledgeBaseId | answer | articleFound | originalQuestion| question | questionId | score | username | city | province | country | Feedback | Comment | Tag

I took your sample data and created a json file. I read it in with Spark using this code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json('/tmp/data.json')
df.show()
and it gave me:
+--------------------+--------------------+--------------------+
| context| event| internal|
+--------------------+--------------------+--------------------+
|{{Thu 10/15/2020 ...|[{1, Zip/Postal C...| {{1.61, XXXX}}|
|{{Thu 10/15/2020 ...| [{1, Activity}]|{{1.61, 992b0fc7-...|
+--------------------+--------------------+--------------------+
The problem with this format was I was losing metadata. So this made me change the approach. My attempt is to load the JSON as a string column and then parse it later. You can do this by using:
df = spark.read.text('/tmp/data.json')
df.show()
which gives:
+--------------------+
| value|
+--------------------+
|{"event": [{"name...|
|{"event": [{"name...|
+--------------------+
From here, we can use a pandas UDF (or normal UDF) to process it. I will use the Fugue library as a way to easily convert Python and Pandas code to a Pandas UDF, but you can just turn the logic into a Pandas UDF later if you don't want to use Fugue.
Your final schema is very long so I think the concept will be clearer if I just use the first 3 columns. In this snippet I will extract:
ActivityId | ActivityType | ChannelId
In order to prototype, I will convert the original DataFrame to Pandas:
pdf = df.toPandas()
And then I will make a function that holds the logic. Some of this code may be repetitive and you might be able to simplify it with functions. I think this should be enough to illustrate the logic. One of the frustrating pieces was that it's tedious to pull some fields. There are Lists of Dicts that are a bit hard to access, but you can still get it to work.
import json
from typing import List, Dict, Any, Iterable
def process(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
for row in df:
record = json.loads(df[0]["value"])
# Activity Id
activity_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_id = [x for x in activity_id if "Activity ID" in x.keys()]
if len(activity_id) == 1:
activity_id = activity_id[0]['Activity ID']
else:
activity_id = None
# Activity Type
activity_type = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_type = [x for x in activity_type if "Activity Type" in x.keys()]
if len(activity_type) == 1:
activity_type = activity_type[0]['Activity Type']
else:
activity_type = None
# Channel Id
channel_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
channel_id = [x for x in channel_id if "Channel ID" in x.keys()]
if len(channel_id) == 1:
channel_id = channel_id[0]['Channel ID']
else:
channel_id = None
yield {"ActivityId": activity_id,
"ActivityType": activity_type,
"ChannelId": channel_id,
}
This function just converts each row to json and then extracts the relevant fields. You might notice that the input and output types are not Pandas DataFrames. This is okay because Fugue can handle the conversion for us. In order to test this function, we can do:
import fugue.api as fa
schema = "ActivityId:str, ActivityType:str, ChannelId:str"
out = fa.transform(pdf, process, schema=schema)
# output is Pandas
out.head()
and this will adapt the process function to run on Pandas DataFrames. Schema is a requirement for Spark, so Fugue requires it as well. This gives us the following result:
ActivityId ActivityType ChannelId
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
Now that we know it works on Pandas, we can bring it to Spark with the exact same command. We just need to pass in the Spark DataFrame instead.
out = fa.transform(df, process, schema=schema)
# returns a Spark DataFrame
out.show()
Under the hood, Fugue will convert each partition to a List[Dict[str,Any] and then apply the process function. In this case, it is just applied on the default partitions of your DataFrame. The output annotation Iterable[Dict[str,Any] guides Fugue how to bring the result back out to a Spark DataFrame.

Creating a table from an API response using PySpark

Response to one of my API calls is as below:
{
"requestId": "W2840866301623983629",
"items": [
{
"status": "SUCCESS",
"code": 0,
"apId": "amor:h973hw839sjw8933",
"data": {
"parentVis": 4836,
"parentmeet": 12,
"vis": 908921,
"out": 209481
}
},
{
"status": "SUCCESS",
"code": 0,
"apId": "complex:3d180779a7ea2b05f9a3c5c8",
"data": {
"parentVis": 5073,
"parentmeet": 9,
"vis": 623021,
"out": 168209
}
}
]
}
I'm trying to create a table as below:
+-----------+-------+-----------------------+---------------+-----------+-----------+-----------+
|status |code |apId |parentVis |parentmeet |vis |out |
+-----------+-------+-----------------------+---------------+-----------+-----------+-----------+
|SUCCESS |0 |amor:h973hw839sjw8933 |4836 |12 |908921 |209481 |
|SUCCESS |0 |amor:p0982hny23 |5073 |9 |623021 |168209 |
+-----------+-------+-----------------------+---------------+-----------+-----------+-----------+
I tried to store the API response as string and tried sc.parallelize, but I was unable to achieve the result.
Can someone please help me with the best approach?

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data['items'])
df.show()
that should do it (assuming your json sits in data dict)

If your response is a string...
s = """{"requestId":"W2840866301623983629","items":[{"status":"SUCCESS","code":0,"apId":"amor:h973hw839sjw8933","data":{"parentVis":4836,"parentmeet":12,"vis":908921,"out":209481}},{"status":"SUCCESS","code":0,"apId":"complex:3d180779a7ea2b05f9a3c5c8","data":{"parentVis":5073,"parentmeet":9,"vis":623021,"out":168209}}]}"""
you can use json library to extract the outer layer and then describe what's inside to extract that too.
from pyspark.sql import functions as F
import json
data_cols = ['parentVis', 'parentmeet', 'vis', 'out']
df = spark.createDataFrame(json.loads(s)['items']).select(
'status', 'code', 'apId',
*[F.col("data")[c].alias(c) for c in data_cols]
)
df.show(truncate=0)
# +-------+----+--------------------------------+---------+----------+------+------+
# |status |code|apId |parentVis|parentmeet|vis |out |
# +-------+----+--------------------------------+---------+----------+------+------+
# |SUCCESS|0 |amor:h973hw839sjw8933 |4836 |12 |908921|209481|
# |SUCCESS|0 |complex:3d180779a7ea2b05f9a3c5c8|5073 |9 |623021|168209|
# +-------+----+--------------------------------+---------+----------+------+------+

Assuming your API response is a dictionary, this should suffice
Store the API response in a dictionary:
dictionary = {
"requestId": "W2840866301623983629",
"items": [
{
"status": "SUCCESS",
"code": 0,
"apId": "amor:h973hw839sjw8933",
"data": {
"parentVis": 4836,
"parentmeet": 12,
"vis": 908921,
"out": 209481
}
},
{
"status": "SUCCESS",
"code": 0,
"apId": "complex:3d180779a7ea2b05f9a3c5c8",
"data": {
"parentVis": 5073,
"parentmeet": 9,
"vis": 623021,
"out": 168209
}
}
]
}
Create the schema based on the input API response providing the datatypes and all
from pyspark.sql.types import *
schema = StructType([ \
StructField("status",StringType(),True), \
StructField("code",StringType(),True), \
StructField("apId",StringType(),True), \
StructField("data",
StructType([\
StructField("parentVis", IntegerType(),True), \
StructField("parentmeet", IntegerType(),True), \
StructField("vis", IntegerType(),True),\
StructField("out", IntegerType(),True)\
]) ,True), \
])
Create a dataframe using dictionary and using the above created schema
df = spark.createDataFrame(dictionary['items'],schema = schema)
Then create a new dataframe out of by selecting the nested columns as follows:
from pyspark.sql.functions import *
df1 = df.select(col('status'),col('code'),col('apId'),col('data.parentVis'),col('data.parentmeet'),col('data.vis'),col('data.out'))
df1.show(truncate = False)
Please check the below image for reference:

Extracting str from pandas dataframe using json

I read csv file into a dataframe named df
Each rows contains str below.
'{"id":2140043003,"name":"Olallo Rubio",...}'
I would like to extract "name" and "id" from each row and make a new dataframe to store the str.
I use the following codes to extract but it shows an error. Please let me know if there is any suggestions on how to solve this problem. Thanks
JSONDecodeError: Expecting ',' delimiter: line 1 column 32 (char 31)

text={
"id": 2140043003,
"name": "Olallo Rubio",
"is_registered": True,
"chosen_currency": 'Null',
"avatar": {
"thumb": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls": {
"web": {
"user": "https://www.kickstarter.com/profile/2140043003"
},
"api": {
"user": "https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}
def extract(text,*args):
list1=[]
for i in args:
list1.append(text[i])
return list1
print(extract(text,'name','id'))
# ['Olallo Rubio', 2140043003]

Here's what I came up with using pandas.json_normalize():
import pandas as pd
sample = [{
"id": 2140043003,
"name":"Olallo Rubio",
"is_registered": True,
"chosen_currency": None,
"avatar":{
"thumb":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls":{
"web":{
"user":"https://www.kickstarter.com/profile/2140043003"
},
"api":{
"user":"https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}]
# Create datafrane
df = pd.json_normalize(sample)
# Select columns into new dataframe.
df1 = df.loc[:, ["name", "id",]]
Check df1:
Input:
print(df1)
Output:
name id
0 Olallo Rubio 2140043003

how to extract specific data from json and put in to csv using python

I have a JSON which is in nested form. I would like to extract specific data from json and put into csv using pandas python.
data = {
"class":"hudson.model.Hudson",
"jobs":[
{
"_class":"hudson.model.FreeStyleProject",
"name":"git_checkout",
"url":"http://localhost:8080/job/git_checkout/",
"builds":[
{
"_class":"hudson.model.FreeStyleBuild",
"duration":1201,
"number":6,
"result":"FAILURE",
"url":"http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class":"hudson.model.FreeStyleProject",
"name":"output",
"url":"http://localhost:8080/job/output/",
"builds":[
]
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name":"pipeline_test",
"url":"http://localhost:8080/job/pipeline_test/",
"builds":[
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":9274,
"number":85,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/85/"
},
{
"_class":"org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration":4251,
"number":84,
"result":"SUCCESS",
"url":"http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
From the above JSON i want to fetch jobs name value and builds result value . I am new to python any help will be appreciated .
Till now i have tried
main_data = data['jobs]
json_normalize(main_data,['builds'],
record_prefix='jobs_', errors='ignore')
which gives information only build key values and not the name of job .
Can anyone help ?
Expected Output:

Considering only first build result value you can need to be in csv column you can achieve this using pandas.
data = {
"class": "hudson.model.Hudson",
"jobs": [
{
"_class": "hudson.model.FreeStyleProject",
"name": "git_checkout",
"url": "http://localhost:8080/job/git_checkout/",
"builds": [
{
"_class": "hudson.model.FreeStyleBuild",
"duration": 1201,
"number": 6,
"result": "FAILURE",
"url": "http://localhost:8080/job/git_checkout/6/"
}
]
},
{
"_class": "hudson.model.FreeStyleProject",
"name": "output",
"url": "http://localhost:8080/job/output/",
"builds": []
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowJob",
"name": "pipeline_test",
"url": "http://localhost:8080/job/pipeline_test/",
"builds": [
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 9274,
"number": 85,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/85/"
},
{
"_class": "org.jenkinsci.plugins.workflow.job.WorkflowRun",
"duration": 4251,
"number": 84,
"result": "SUCCESS",
"url": "http://localhost:8080/job/pipeline_test/84/"
}
]
}
]
}
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
res['name'].append(name_dict.get('name','NA'))
resultval = name_dict['builds'][0].get('result') if len(name_dict['builds'])>0 else 'NA'
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/file_timer/jobs.csv", index=False)
Check the csv file output
name,result
git_checkout,FAILURE
output,NA
pipeline_test,SUCCESS
If 'NA' result want to skip then
main_data = data.get('jobs')
res = {'name':[], 'result':[]}
for name_dict in main_data:
if len(name_dict['builds'])==0:
continue
res['name'].append(name_dict.get('name', 'NA'))
resultval = name_dict['builds'][0].get('result')
res['result'].append(resultval)
print(res)
import pandas as pd
df = pd.DataFrame(res)
df.to_csv("/home/akash.pagar/shell_learning/file_timer/jobs.csv", index=False)
Output will bw like
name,result
git_checkout,FAILURE
pipeline_test,SUCCESS

Simply with build number,
for job in data.get('jobs'):
for build in job.get('builds'):
print(job.get('name'), build.get('number'), build.get('result'))
gives the result
git_checkout 6 FAILURE
pipeline_test 85 SUCCESS
pipeline_test 84 SUCCESS
If you want to get the result of latest build, and pretty sure about the build number always in decending order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), job.get('builds')[0].get('result'))
and if you are not sure the order,
for job in data.get('jobs'):
if job.get('builds'):
print(job.get('name'), sorted(job.get('builds'), key=lambda k: k.get('number'))[-1].get('result'))
then the result will be:
git_checkout FAILURE
pipeline_test SUCCESS

Assuming last build is the last element of its list and you don't care about jobs with no builds, this does:
import pandas as pd
#data = ... #same format as in the question
z = [(job["name"], job["builds"][-1]["result"]) for job in data["jobs"] if len(job["builds"])]
df = pd.DataFrame(data=z, columns=["name", "result"])
#df.to_csv #TODO
Also we don't necessarily need pandas to create the csv file.
You could do:
import csv
#z = ... #see previous code block
with open("f.csv", 'w') as fp:
csv.writer(fp).writerows([("name", "result")] + z)

How to read this JSON into dataframe with specfic dataframe format

This is my JSON string, I want to make it read into dataframe in the following tabular format.
I have no idea what should I do after pd.Dataframe(json.loads(data))
JSON data, edited
{
"data":[
{
"data":{
"actual":"(0.2)",
"upper_end_of_central_tendency":"-"
},
"title":"2009"
},
{
"data":{
"actual":"2.8",
"upper_end_of_central_tendency":"-"
},
"title":"2010"
},
{
"data":{
"actual":"-",
"upper_end_of_central_tendency":"2.3"
},
"title":"longer_run"
}
],
"schedule_id":"2014-03-19"
}

That's a somewhat overly nested JSON. But if that's what you have to work with, and assuming your parsed JSON is in jdata:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ] + ['schedule_id' ]
sched_id = jdata['schedule_id']
rows = [ [item['data'][rn] for item in datapts ] + [sched_id] for rn in rownames]
df = pd.DataFrame(rows, index=rownames, columns=colnames)
df is now:
If you wanted to simplify that a bit, you could construct the core data without the asymmetric schedule_id field, then add that after the fact:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ]
rows = [ [item['data'][rn] for item in datapts ] for rn in rownames]
d2 = pd.DataFrame(rows, index=rownames, columns=colnames)
d2['schedule_id'] = jdata['schedule_id']
That will make an identical DataFrame (i.e. df == d2). It helps when learning pandas to try a few different construction strategies, and get a feel for what is more straightforward. There are more powerful tools for unfolding nested structures into flatter tables, but they're not as easy to understand first time out of the gate.
(Update) If you wanted a better structuring on your JSON to make it easier to put into this format, ask pandas what it likes. E.g. df.to_json() output, slightly prettified:
{
"2009": {
"actual": "(0.2)",
"upper_end_of_central_tendency": "-"
},
"2010": {
"actual": "2.8",
"upper_end_of_central_tendency": "-"
},
"longer_run": {
"actual": "-",
"upper_end_of_central_tendency": "2.3"
},
"schedule_id": {
"actual": "2014-03-19",
"upper_end_of_central_tendency": "2014-03-19"
}
}
That is a format from which pandas' read_json function will immediately construct the DataFrame you desire.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to transpose JSON structs and arrays in PySpark - python

You can use the inline function. df = spark.read.load(json_file_path, format='json', multiLine=True) df = df.selectExpr('inline(details.box)').withColumnRenamed('field', 'Team') df.show(truncate=False)

Related

By using PySpark how to parse nested JSON (Blob format)

Creating a table from an API response using PySpark

Extracting str from pandas dataframe using json

how to extract specific data from json and put in to csv using python

How to read this JSON into dataframe with specfic dataframe format

Categories

Resources