By using PySpark how to parse nested JSON (Blob format)

By using PySpark how to parse nested JSON (Blob format) - python

I'm getting the following records in blob format with a new line separated. Below is an example of two events separated by a newline,
Few things to note here,
In the example below, event(Structure) are in inconsistent. For certain events i will get Channel Id,conversation Id,replyActivity Id,from Id,locale columns and for absent columns i need to populate as null in my data frame.
How will i able to achive this in Pyspark ?
Example:
{
"event":[
{
"name":"Zip/Postal Code",
"count":1
}
],
"internal":{
"data":{
"id":"XXXX",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:48.7734934Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eus2",
"roleInstance":"RD0004FFA145F5",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"f115c4bf-4fa31385d9a8f248",
"parentId":"|f115c4bf-4fa31385d9a8f248."
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Boydton"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000006"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|f115c4bf-4fa31385d9a8f248."
},
{
"Channel ID":"directline"
},
{
"Recipient ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums#Ye6TP1LJz0o"
},
{
"Bot ID":"XXXX"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}{
"event":[
{
"name":"Activity",
"count":1
}
],
"internal":{
"data":{
"id":"992b0fc7-12ec-11eb-b59a-fb2df7d234d8",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:34.3811795Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eastus3",
"roleInstance":"RD00155D33F838",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00",
"parentId":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Washington"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000000"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
{
"Channel ID":"directline"
},
{
"Bot ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}
I need to extract these records in to following table format (Column Name mentioned below),
ActivityId | ActivityType | ChannelId | conversationId | replyActivityId | fromId | locale | recipientId | speak | text | name |eventTime | Date | InstanceId | DialogId | StepName | applicationId | intent | intentScore | entities | question | sentimentLabel | sentimentScore | knowledgeBaseId | answer | articleFound | originalQuestion| question | questionId | score | username | city | province | country | Feedback | Comment | Tag

I took your sample data and created a json file. I read it in with Spark using this code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json('/tmp/data.json')
df.show()
and it gave me:
+--------------------+--------------------+--------------------+
| context| event| internal|
+--------------------+--------------------+--------------------+
|{{Thu 10/15/2020 ...|[{1, Zip/Postal C...| {{1.61, XXXX}}|
|{{Thu 10/15/2020 ...| [{1, Activity}]|{{1.61, 992b0fc7-...|
+--------------------+--------------------+--------------------+
The problem with this format was I was losing metadata. So this made me change the approach. My attempt is to load the JSON as a string column and then parse it later. You can do this by using:
df = spark.read.text('/tmp/data.json')
df.show()
which gives:
+--------------------+
| value|
+--------------------+
|{"event": [{"name...|
|{"event": [{"name...|
+--------------------+
From here, we can use a pandas UDF (or normal UDF) to process it. I will use the Fugue library as a way to easily convert Python and Pandas code to a Pandas UDF, but you can just turn the logic into a Pandas UDF later if you don't want to use Fugue.
Your final schema is very long so I think the concept will be clearer if I just use the first 3 columns. In this snippet I will extract:
ActivityId | ActivityType | ChannelId
In order to prototype, I will convert the original DataFrame to Pandas:
pdf = df.toPandas()
And then I will make a function that holds the logic. Some of this code may be repetitive and you might be able to simplify it with functions. I think this should be enough to illustrate the logic. One of the frustrating pieces was that it's tedious to pull some fields. There are Lists of Dicts that are a bit hard to access, but you can still get it to work.
import json
from typing import List, Dict, Any, Iterable
def process(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
for row in df:
record = json.loads(df[0]["value"])
# Activity Id
activity_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_id = [x for x in activity_id if "Activity ID" in x.keys()]
if len(activity_id) == 1:
activity_id = activity_id[0]['Activity ID']
else:
activity_id = None
# Activity Type
activity_type = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_type = [x for x in activity_type if "Activity Type" in x.keys()]
if len(activity_type) == 1:
activity_type = activity_type[0]['Activity Type']
else:
activity_type = None
# Channel Id
channel_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
channel_id = [x for x in channel_id if "Channel ID" in x.keys()]
if len(channel_id) == 1:
channel_id = channel_id[0]['Channel ID']
else:
channel_id = None
yield {"ActivityId": activity_id,
"ActivityType": activity_type,
"ChannelId": channel_id,
}
This function just converts each row to json and then extracts the relevant fields. You might notice that the input and output types are not Pandas DataFrames. This is okay because Fugue can handle the conversion for us. In order to test this function, we can do:
import fugue.api as fa
schema = "ActivityId:str, ActivityType:str, ChannelId:str"
out = fa.transform(pdf, process, schema=schema)
# output is Pandas
out.head()
and this will adapt the process function to run on Pandas DataFrames. Schema is a requirement for Spark, so Fugue requires it as well. This gives us the following result:
ActivityId ActivityType ChannelId
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
Now that we know it works on Pandas, we can bring it to Spark with the exact same command. We just need to pass in the Spark DataFrame instead.
out = fa.transform(df, process, schema=schema)
# returns a Spark DataFrame
out.show()
Under the hood, Fugue will convert each partition to a List[Dict[str,Any] and then apply the process function. In this case, it is just applied on the default partitions of your DataFrame. The output annotation Iterable[Dict[str,Any] guides Fugue how to bring the result back out to a Spark DataFrame.

Related

Creating a table from an API response using PySpark

Response to one of my API calls is as below:
{
"requestId": "W2840866301623983629",
"items": [
{
"status": "SUCCESS",
"code": 0,
"apId": "amor:h973hw839sjw8933",
"data": {
"parentVis": 4836,
"parentmeet": 12,
"vis": 908921,
"out": 209481
}
},
{
"status": "SUCCESS",
"code": 0,
"apId": "complex:3d180779a7ea2b05f9a3c5c8",
"data": {
"parentVis": 5073,
"parentmeet": 9,
"vis": 623021,
"out": 168209
}
}
]
}
I'm trying to create a table as below:
+-----------+-------+-----------------------+---------------+-----------+-----------+-----------+
|status |code |apId |parentVis |parentmeet |vis |out |
+-----------+-------+-----------------------+---------------+-----------+-----------+-----------+
|SUCCESS |0 |amor:h973hw839sjw8933 |4836 |12 |908921 |209481 |
|SUCCESS |0 |amor:p0982hny23 |5073 |9 |623021 |168209 |
+-----------+-------+-----------------------+---------------+-----------+-----------+-----------+
I tried to store the API response as string and tried sc.parallelize, but I was unable to achieve the result.
Can someone please help me with the best approach?

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data['items'])
df.show()
that should do it (assuming your json sits in data dict)

If your response is a string...
s = """{"requestId":"W2840866301623983629","items":[{"status":"SUCCESS","code":0,"apId":"amor:h973hw839sjw8933","data":{"parentVis":4836,"parentmeet":12,"vis":908921,"out":209481}},{"status":"SUCCESS","code":0,"apId":"complex:3d180779a7ea2b05f9a3c5c8","data":{"parentVis":5073,"parentmeet":9,"vis":623021,"out":168209}}]}"""
you can use json library to extract the outer layer and then describe what's inside to extract that too.
from pyspark.sql import functions as F
import json
data_cols = ['parentVis', 'parentmeet', 'vis', 'out']
df = spark.createDataFrame(json.loads(s)['items']).select(
'status', 'code', 'apId',
*[F.col("data")[c].alias(c) for c in data_cols]
)
df.show(truncate=0)
# +-------+----+--------------------------------+---------+----------+------+------+
# |status |code|apId |parentVis|parentmeet|vis |out |
# +-------+----+--------------------------------+---------+----------+------+------+
# |SUCCESS|0 |amor:h973hw839sjw8933 |4836 |12 |908921|209481|
# |SUCCESS|0 |complex:3d180779a7ea2b05f9a3c5c8|5073 |9 |623021|168209|
# +-------+----+--------------------------------+---------+----------+------+------+

Assuming your API response is a dictionary, this should suffice
Store the API response in a dictionary:
dictionary = {
"requestId": "W2840866301623983629",
"items": [
{
"status": "SUCCESS",
"code": 0,
"apId": "amor:h973hw839sjw8933",
"data": {
"parentVis": 4836,
"parentmeet": 12,
"vis": 908921,
"out": 209481
}
},
{
"status": "SUCCESS",
"code": 0,
"apId": "complex:3d180779a7ea2b05f9a3c5c8",
"data": {
"parentVis": 5073,
"parentmeet": 9,
"vis": 623021,
"out": 168209
}
}
]
}
Create the schema based on the input API response providing the datatypes and all
from pyspark.sql.types import *
schema = StructType([ \
StructField("status",StringType(),True), \
StructField("code",StringType(),True), \
StructField("apId",StringType(),True), \
StructField("data",
StructType([\
StructField("parentVis", IntegerType(),True), \
StructField("parentmeet", IntegerType(),True), \
StructField("vis", IntegerType(),True),\
StructField("out", IntegerType(),True)\
]) ,True), \
])
Create a dataframe using dictionary and using the above created schema
df = spark.createDataFrame(dictionary['items'],schema = schema)
Then create a new dataframe out of by selecting the nested columns as follows:
from pyspark.sql.functions import *
df1 = df.select(col('status'),col('code'),col('apId'),col('data.parentVis'),col('data.parentmeet'),col('data.vis'),col('data.out'))
df1.show(truncate = False)
Please check the below image for reference:

Python API JSON Output Flatten - List within Response

I'm trying to completely flatten my json response from an API into a pandas dataframe and have no idea how to flatten out a list of objects within the response - This relates to the "Lines" column located in the documentation here and below.
"Lines" : [
{
"Account" : {
"UID" : "17960eb4-3e14-4805-aae2-5b2387da1153",
"Name" : "Trade Debtors",
"DisplayID" : "1-1310",
"URI" : "{cf_uri}/GeneralLedger/Account/17960eb4-3e14-4805-aae2-5b2387da1153"
},
"Amount" : 100,
"IsCredit" : false,
"Job" : null,
"LineDescription" : ""
"ReconciledDate" : null,
"UnitCount": null
},
{
"Account" : {
"UID" : "f7d18c92-ada8-428e-b02a-9223022f84b2",
"Name" : "Late Fees Collected",
"DisplayID" : "4-3000",
"URI" : "{cf_uri}/GeneralLedger/Account/f7d18c92-ada8-428e-b02a-9223022f84b2"
},
"Amount" : 90.91,
"IsCredit" : true,
"Job" : null,
"LineDescription" : "Line 1 testing",
"UnitCount": null
},
{
"Account" : {
"UID" : "5427d47c-499a-4386-ad67-72de39520a00",
"Name" : "GST Collected",
"DisplayID" : "2-1210",
"URI" : "{cf_uri}/GeneralLedger/Account/5427d47c-499a-4386-ad67-72de39520a00"
},
"Amount" : 9.09,
"IsCredit" : true,
"Job" : null,
"LineDescription" : "",
"ReconciledDate" : null,
"UnitCount": null
}
],
My Code:
import pandas as pd
import requests
payload={}
headers = {
'x-myobapi-key': client_id,
'x-myobapi-version': 'v2',
'Accept-Encoding': 'gzip,deflate',
'Authorization': f'Bearer {access_token}'
}
response = requests.request("GET", url, headers=headers, data=payload)
result = response.json()
df = pd.json_normalize(result, 'Items')
while result['NextPageLink'] is not None:
response = requests.request("GET", result['NextPageLink'], headers=headers, data=payload)
result = response.json()
df1 = pd.json_normalize(result, 'Items')
df = df.append(df1)
This code above appends each page of results until there isn't a link, as you can see the following output was able to expand the SourceTransactions columns but not the Lines columns as it appears to be in list format?
In order for me to access lines I need to use the following result["Items"][0]["Lines"] except that's only for the first element
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
| UID | DisplayID | JournalType | DateOccurred | DatePosted | Description | Lines | URI | RowVersion | SourceTransaction.UID | SourceTransaction.TransactionType | SourceTransaction.URI |
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
| a100 | PJ001 | Purchase | 2022-01-01 | 2022-01-01 | Transaction | [{'Account': {'UID': '73971... | https://arl1.api.my... | -139 | e06f592c-23b | Bill | https://arl1.api... |
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+

For others that stumble across the same problem, turns out reading the documentation helped me - Who knew?!
I just needed to make a few tweaks with the json_normalize function. Also the order you type the meta parameters in matters, so you'll need to ensure your order matches the API documentation.
df = pd.json_normalize(data=result["Items"], record_path='Lines', meta=['UID',
'DisplayID','JournalType', ['SourceTransaction', 'UID'], ['SourceTransaction', 'TransactionType'], ['SourceTransaction',
'URI'], 'DateOccurred','DatePosted','Description','URI','RowVersion'])

How to transpose JSON structs and arrays in PySpark

I have the following Json file that I'm reading into a dataframe.
{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}
How could I manipulate this to get the following output?
Team
Touchdowns
Texans
123
Ravens
456
I'm struggling a bit with whether I need to pivot/transpose the data or if there is a more elegant approach.

Read the multiline json into spark
df = spark.read.json('/path/to/scores.json',multiLine=True)
Schema
df:pyspark.sql.dataframe.DataFrame
details:struct
box:array
element:struct
Touchdowns:string
field:string
name:string
All of the info you want is in the first row, so get that and drill down to details and box and make that your new dataframe.
spark.createDataFrame(df.first()['details']['box']).withColumnRenamed('field','Team').show()
Output
+----------+------+
|Touchdowns| Team|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+

You can use the inline function.
df = spark.read.load(json_file_path, format='json', multiLine=True)
df = df.selectExpr('inline(details.box)').withColumnRenamed('field', 'Team')
df.show(truncate=False)

You can try using a rdd to get the values of box list.
Input JSON
jsonstr="""{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}"""
Now convert it to an rdd using the keys of dictionary as below -
import json
box_rdd = sc.parallelize(json.loads(jsonstr)['details']['box'])
box_rdd.collect()
Output - [{'Touchdowns': '123', 'field': 'Texans'},
{'Touchdowns': '456', 'field': 'Ravens'}]
Finally create the dataframe with this box_rdd as below -
from pyspark.sql.types import *
schema = StructType([StructField('Touchdowns', StringType(), True), StructField('field', StringType(), True)])
df = spark.createDataFrame(data=box_rdd,schema=schema)
df.show()
+----------+------+
|Touchdowns| field|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+

Complex date range extraction from json using Python

I want to extract all the dates after 2019-10-21 till today from the below json response using python. I'm very new to Python and just beginning to explore new functions. Can anyone give me a hint to start with?
my api response
response=
{ "id": "100",
"location":
{
"address1" : {"city":"x", "state":"y", "zip":"55"},
"address2" : {"city":"g", "state":"h", "zip":"33"},
},
"date": [
{
"shipping_date": "2020-12-13",
"shipping_name": "xuv"
},
{
"shipping_date": "2014-11-31",
"shipping_name": "yuv"
},
{
"shipping_date": "2020-12-14",
"shipping_name": "puv"
},
{
"shipping_date": "2020-08-22",
"shipping_name": "juv"
},
{
"shipping_date": "2019-10-21",
"shipping_name": "auv"
} ]
}
my output
id | shipping_date | shipping_name
100| 2020-12-13 | xuv
100| 2020-12-14 | puv
100| 2020-08-22 | juv

for data in response["date"]:
print(data["shipping_date"])
Use this code to get all the shipping date.

Building upon Sagun Devkota's answer as OP asked for data between specific date ranges.
You can use time.strptime to parse and compare dates.
Suppose you need everything from 1-1-2019 to 31-12-2020, you can do,
import time
start = time.strptime("2019-1-1","%Y-%m-%d")
end = time.strptime("2020-12-31","%Y-%m-%d")
for data in response["date"]:
date = time.strptime(data["shipping_date"],"%Y-%m-%d")
if date>=start and date<=end:
print(data["shipping_date"])
PS: November has 30 days while the shipping_date at index 1 is 11-31.

Python Cubes OLAP Framework - how to work with joins?

I'm trying to use python's olap framework cubes on a very simple database, but I am having some trouble joining tables.
My schema looks like this:
Users table
ID | name
Products table
ID | name | price
Purchases table
ID | user_id | product_id | date
And the cubes model:
{
'dimensions': [
{'name': 'user_id'},
{'name': 'product_id'},
{'name': 'date'},
],
'cubes': [
{
'name': 'purchases',
'dimensions': ['user_id', 'product_id', 'date'],
'measures': ['price']
'mappings': {
'purchases.user_id': 'users.id',
'purchases.product_id': 'products.id',
'purchases.price': 'products.price'
},
'joins': [
{
'master': 'purchases.user_id',
'detail': 'users.id'
},
{
'master': 'purchases.product_id',
'detail': 'products.id'
}
]
}
]
}
Now I would like to display all the purchases, showing the product's name, user's name and purchase date. I can't seem to find a way to do this. The documentation is a bit scarce.
Thank you

First let's fix the model a bit. In your schema you have more attributes per dimension: id and name, you might end up having more details in the future. You can add them by specifying attributes as a list: "attriubtes": ["id", "name"]. Note also that the dimension is named as entity product not as a key id_product. The key id_product is just an attribute of the product dimension, as is name or in the future maybe category. Dimension reflects analysts point of view.
For the time being, we ignore the fact that date should be a special dimension and consider date as single-value key, for example a year, not to make things complicated here.
"dimensions": [
{"name": "user", "attributes": ["id", "name"]},
{"name": "product", "attributes": ["id", "name"]},
{"name": "date"}
],
Because we changed names of the dimensions, we have to change them in the cube's dimension list:
"cubes": [
{
"name": "purchases",
"dimensions": ["user", "product", "date"],
...
Your schema reflects classic transactional schema, not traditional data warehouse schema. In this case, you have to be explicit, as you were, and mention all necessary mappings. The rule is: if the attribute belongs to a fact table (logical view), then the key is just attribute, such as price, no table specification. If the attribute belongs to a dimension, such as product.id, then the syntax is dimension.attribute. The value of the mappings dictionary is physical table and physical column. See more information about mappings. Mappings for your schema look like:
"mappings": {
"price": "products.price",
"product.id": "products.id",
"product.name": "products.name",
"user.id": "users.id",
"user.name": "users.name"
}
You would not have to write mappings if your schema was:
fact purchases
id | date | user_id | product_id | amount
dimension product
id | name | price
dimension user
id | name
In this case you will need only joins, because all dimension attributes are in their respective dimension tables. Note the amount in the fact table, which in your case, as you do not have count of purchased products per purchase, would be the same as price in product.
Here is the updated model for your model:
{
"dimensions": [
{"name": "user", "attributes": ["id", "name"]},
{"name": "product", "attributes": ["id", "name"]},
{"name": "date"}
],
"cubes": [
{
"name": "purchases",
"dimensions": ["user", "product", "date"],
"measures": ["price"],
"mappings": {
"price": "products.price",
"product.id": "products.id",
"product.name": "products.name",
"user.id": "users.id",
"user.name": "users.name"
},
"joins": [
{
"master": "purchases.user_id",
"detail": "users.id"
},
{
"master": "purchases.product_id",
"detail": "products.id"
}
]
}
]
}
You can try the model without writing any Python code, just by using the slicer command. For that you will need slicer.ini configuration file:
[server]
backend: sql
port: 5000
log_level: info
prettyprint: yes
[workspace]
url: sqlite:///data.sqlite
[model]
path: model.json
Change url in [workspace] to point to your database and change path in [model] to point to your model file. Now you can try:
curl "http://localhost:5000/aggregate"
Also try to drill-down:
curl "http://localhost:5000/aggregate?drilldown=product"
If you need any further help, just let me know, I'm the Cubes author.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

By using PySpark how to parse nested JSON (Blob format) - python

Related

Creating a table from an API response using PySpark

Python API JSON Output Flatten - List within Response

How to transpose JSON structs and arrays in PySpark

Complex date range extraction from json using Python

Python Cubes OLAP Framework - how to work with joins?

Categories

Resources