I'm trying to completely flatten my json response from an API into a pandas dataframe and have no idea how to flatten out a list of objects within the response - This relates to the "Lines" column located in the documentation here and below.
"Lines" : [
{
"Account" : {
"UID" : "17960eb4-3e14-4805-aae2-5b2387da1153",
"Name" : "Trade Debtors",
"DisplayID" : "1-1310",
"URI" : "{cf_uri}/GeneralLedger/Account/17960eb4-3e14-4805-aae2-5b2387da1153"
},
"Amount" : 100,
"IsCredit" : false,
"Job" : null,
"LineDescription" : ""
"ReconciledDate" : null,
"UnitCount": null
},
{
"Account" : {
"UID" : "f7d18c92-ada8-428e-b02a-9223022f84b2",
"Name" : "Late Fees Collected",
"DisplayID" : "4-3000",
"URI" : "{cf_uri}/GeneralLedger/Account/f7d18c92-ada8-428e-b02a-9223022f84b2"
},
"Amount" : 90.91,
"IsCredit" : true,
"Job" : null,
"LineDescription" : "Line 1 testing",
"UnitCount": null
},
{
"Account" : {
"UID" : "5427d47c-499a-4386-ad67-72de39520a00",
"Name" : "GST Collected",
"DisplayID" : "2-1210",
"URI" : "{cf_uri}/GeneralLedger/Account/5427d47c-499a-4386-ad67-72de39520a00"
},
"Amount" : 9.09,
"IsCredit" : true,
"Job" : null,
"LineDescription" : "",
"ReconciledDate" : null,
"UnitCount": null
}
],
My Code:
import pandas as pd
import requests
payload={}
headers = {
'x-myobapi-key': client_id,
'x-myobapi-version': 'v2',
'Accept-Encoding': 'gzip,deflate',
'Authorization': f'Bearer {access_token}'
}
response = requests.request("GET", url, headers=headers, data=payload)
result = response.json()
df = pd.json_normalize(result, 'Items')
while result['NextPageLink'] is not None:
response = requests.request("GET", result['NextPageLink'], headers=headers, data=payload)
result = response.json()
df1 = pd.json_normalize(result, 'Items')
df = df.append(df1)
This code above appends each page of results until there isn't a link, as you can see the following output was able to expand the SourceTransactions columns but not the Lines columns as it appears to be in list format?
In order for me to access lines I need to use the following result["Items"][0]["Lines"] except that's only for the first element
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
| UID | DisplayID | JournalType | DateOccurred | DatePosted | Description | Lines | URI | RowVersion | SourceTransaction.UID | SourceTransaction.TransactionType | SourceTransaction.URI |
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
| a100 | PJ001 | Purchase | 2022-01-01 | 2022-01-01 | Transaction | [{'Account': {'UID': '73971... | https://arl1.api.my... | -139 | e06f592c-23b | Bill | https://arl1.api... |
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
For others that stumble across the same problem, turns out reading the documentation helped me - Who knew?!
I just needed to make a few tweaks with the json_normalize function. Also the order you type the meta parameters in matters, so you'll need to ensure your order matches the API documentation.
df = pd.json_normalize(data=result["Items"], record_path='Lines', meta=['UID',
'DisplayID','JournalType', ['SourceTransaction', 'UID'], ['SourceTransaction', 'TransactionType'], ['SourceTransaction',
'URI'], 'DateOccurred','DatePosted','Description','URI','RowVersion'])
Related
I'm getting the following records in blob format with a new line separated. Below is an example of two events separated by a newline,
Few things to note here,
In the example below, event(Structure) are in inconsistent. For certain events i will get Channel Id,conversation Id,replyActivity Id,from Id,locale columns and for absent columns i need to populate as null in my data frame.
How will i able to achive this in Pyspark ?
Example:
{
"event":[
{
"name":"Zip/Postal Code",
"count":1
}
],
"internal":{
"data":{
"id":"XXXX",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:48.7734934Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eus2",
"roleInstance":"RD0004FFA145F5",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"f115c4bf-4fa31385d9a8f248",
"parentId":"|f115c4bf-4fa31385d9a8f248."
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Boydton"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000006"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|f115c4bf-4fa31385d9a8f248."
},
{
"Channel ID":"directline"
},
{
"Recipient ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums#Ye6TP1LJz0o"
},
{
"Bot ID":"XXXX"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}{
"event":[
{
"name":"Activity",
"count":1
}
],
"internal":{
"data":{
"id":"992b0fc7-12ec-11eb-b59a-fb2df7d234d8",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:34.3811795Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eastus3",
"roleInstance":"RD00155D33F838",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00",
"parentId":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Washington"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000000"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
{
"Channel ID":"directline"
},
{
"Bot ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}
I need to extract these records in to following table format (Column Name mentioned below),
ActivityId | ActivityType | ChannelId | conversationId | replyActivityId | fromId | locale | recipientId | speak | text | name |eventTime | Date | InstanceId | DialogId | StepName | applicationId | intent | intentScore | entities | question | sentimentLabel | sentimentScore | knowledgeBaseId | answer | articleFound | originalQuestion| question | questionId | score | username | city | province | country | Feedback | Comment | Tag
I took your sample data and created a json file. I read it in with Spark using this code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json('/tmp/data.json')
df.show()
and it gave me:
+--------------------+--------------------+--------------------+
| context| event| internal|
+--------------------+--------------------+--------------------+
|{{Thu 10/15/2020 ...|[{1, Zip/Postal C...| {{1.61, XXXX}}|
|{{Thu 10/15/2020 ...| [{1, Activity}]|{{1.61, 992b0fc7-...|
+--------------------+--------------------+--------------------+
The problem with this format was I was losing metadata. So this made me change the approach. My attempt is to load the JSON as a string column and then parse it later. You can do this by using:
df = spark.read.text('/tmp/data.json')
df.show()
which gives:
+--------------------+
| value|
+--------------------+
|{"event": [{"name...|
|{"event": [{"name...|
+--------------------+
From here, we can use a pandas UDF (or normal UDF) to process it. I will use the Fugue library as a way to easily convert Python and Pandas code to a Pandas UDF, but you can just turn the logic into a Pandas UDF later if you don't want to use Fugue.
Your final schema is very long so I think the concept will be clearer if I just use the first 3 columns. In this snippet I will extract:
ActivityId | ActivityType | ChannelId
In order to prototype, I will convert the original DataFrame to Pandas:
pdf = df.toPandas()
And then I will make a function that holds the logic. Some of this code may be repetitive and you might be able to simplify it with functions. I think this should be enough to illustrate the logic. One of the frustrating pieces was that it's tedious to pull some fields. There are Lists of Dicts that are a bit hard to access, but you can still get it to work.
import json
from typing import List, Dict, Any, Iterable
def process(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
for row in df:
record = json.loads(df[0]["value"])
# Activity Id
activity_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_id = [x for x in activity_id if "Activity ID" in x.keys()]
if len(activity_id) == 1:
activity_id = activity_id[0]['Activity ID']
else:
activity_id = None
# Activity Type
activity_type = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_type = [x for x in activity_type if "Activity Type" in x.keys()]
if len(activity_type) == 1:
activity_type = activity_type[0]['Activity Type']
else:
activity_type = None
# Channel Id
channel_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
channel_id = [x for x in channel_id if "Channel ID" in x.keys()]
if len(channel_id) == 1:
channel_id = channel_id[0]['Channel ID']
else:
channel_id = None
yield {"ActivityId": activity_id,
"ActivityType": activity_type,
"ChannelId": channel_id,
}
This function just converts each row to json and then extracts the relevant fields. You might notice that the input and output types are not Pandas DataFrames. This is okay because Fugue can handle the conversion for us. In order to test this function, we can do:
import fugue.api as fa
schema = "ActivityId:str, ActivityType:str, ChannelId:str"
out = fa.transform(pdf, process, schema=schema)
# output is Pandas
out.head()
and this will adapt the process function to run on Pandas DataFrames. Schema is a requirement for Spark, so Fugue requires it as well. This gives us the following result:
ActivityId ActivityType ChannelId
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
Now that we know it works on Pandas, we can bring it to Spark with the exact same command. We just need to pass in the Spark DataFrame instead.
out = fa.transform(df, process, schema=schema)
# returns a Spark DataFrame
out.show()
Under the hood, Fugue will convert each partition to a List[Dict[str,Any] and then apply the process function. In this case, it is just applied on the default partitions of your DataFrame. The output annotation Iterable[Dict[str,Any] guides Fugue how to bring the result back out to a Spark DataFrame.
I have a JSON file, like this:
{
"data" : [
{ "values" : [ "ColumnHeader1", "ColumnHeader2", "ColumnHeader3" ]},
{ "values" : [ "Row1Column1", "Row1Column2", "Row1Column3" ]},
{ "values" : [ "Row2Column1", "Row2Column2", "Row2Column3" ]}
]
}
I want to transform it, to be like this:
{
data: [
{ "ColumnHeader1" : "Row1Value1", "ColumnHeader2": "Row1Value2", "ColumnHeader3" : "Row1Value3" },
{ "ColumnHeader1" : "Row2Value1", "ColumnHeader2": "Row2Value2", "ColumnHeader3" : "Row2Value3" }
]
}
I did write a Python script for that - but I wonder could something clever be done via jq or pandas ? (or some other Unix tool or Python library...)
A jq-only solution:
def objectify($header):
. as $in
| reduce range(0; $header|length) as $i ({}; .[$header[$i]] = $in[$i] );
.data[0].values as $header
| .data |= (.[1:] | map(.values | objectify($header)) )
If you like nifty:
def objectify($header): with_entries(.key |= $header[.]) ;
So, if you want a two-liner:
.data[0].values as $header
| .data |= (.[1:] | map(.values | with_entries(.key |= $header[.])))
Here is a different jq solution without reduce:
.data |= (
map(.values)
| first as $headers | del(first)
| map(
[ $headers, .]
| transpose
| map({(first): last})
| add
)
)
Output:
{
"data": [
{
"ColumnHeader1": "Row1Column1",
"ColumnHeader2": "Row1Column2",
"ColumnHeader3": "Row1Column3"
},
{
"ColumnHeader1": "Row2Column1",
"ColumnHeader2": "Row2Column2",
"ColumnHeader3": "Row2Column3"
}
]
}
Or to rebuild the result object from scratch:
{
data: (
.data | map(.values)
| first as $headers | del(first)
| map(
[ $headers, .]
| transpose
| map({(first): last})
| add
)
)
}
first as $headers could be rewritten as . as [$headers] or .[0] as $headers. del(first) could be replaced with .[1:].
(not very elegant) Answer
"""
Read a JSON file where the 1st item in array is a set of headers, and the other items are values.
Outputs a JSON file where the other items in array are transposed to use those headers.
"""
import json
from sys import argv
input_json=argv[1]
output_json=argv[2]
data = None
with open(input_json, "r") as infile:
# returns JSON object as a dictionary
data = json.load(infile)
headings = data["data"][0]['values']
new_rows = []
rows = len(data["data"])
for r in range(1, rows):
row = data["data"][r]['values']
new_row = dict()
new_rows.append(new_row)
for h in range(0, len(headings)):
new_row[headings[h]] = row[h]
new_data = dict()
new_data["data"] = new_rows
with open(output_json, "w") as outfile:
json_object = json.dumps(new_data, indent=2)
outfile.write(json_object)
Hoping there is a better way with less code :)
A one liner:
d = {
"data" : [
{ "values" : [ "ColumnHeader1", "ColumnHeader2", "ColumnHeader3" ]},
{ "values" : [ "Row1Column1", "Row1Column2", "Row1Column3" ]},
{ "values" : [ "Row2Column1", "Row2Column2", "Row2Column3" ]}
]
}
d = {"data": [{k: v for k, v in zip(d["data"][0]["values"], row["values"])} for row in d["data"][1:]]}
Outputs:
{'data': [{'ColumnHeader1': 'Row1Column1', 'ColumnHeader2': 'Row1Column2', 'ColumnHeader3': 'Row1Column3'}, {'ColumnHeader1': 'Row2Column1', 'ColumnHeader2': 'Row2Column2', 'ColumnHeader3': 'Row2Column3'}]}
You don't need to iterate all headings. I hope it helps.
data = {
"data": [
{"values": ["ColumnHeader1", "ColumnHeader2", "ColumnHeader3"]},
{"values": ["Row1Column1", "Row1Column2", "Row1Column3"]},
{"values": ["Row2Column1", "Row2Column2", "Row2Column3"]},
]
}
data = data["data"]
headings = data[0]["values"]
rows = data[1:]
new_data = {"data":[dict(zip(headings, row["values"])) for row in rows]}
Here's a solution for jq using transpose and map:
.data |= (map(.values) | transpose
| map([{(.[0]): .[1:][]}]) | transpose
| map(add)
)
{
"data": [
{
"ColumnHeader1": "Row1Column1",
"ColumnHeader2": "Row1Column2",
"ColumnHeader3": "Row1Column3"
},
{
"ColumnHeader1": "Row2Column1",
"ColumnHeader2": "Row2Column2",
"ColumnHeader3": "Row2Column3"
}
]
}
Demo
I have an issue with the POST method in python.
My code :
url = "https://blablabla.com"
data = {
"data1": "blalba",
"data2": "12"
}
resp = requests.post(url, json=data)
print(resp.text)
My output is :
{"result":false,"data":null,"error":{"httpStatusCode":0,"errorMessages":null,"isError":false},"validationErrors":[{"field":"$.sicno","error":["The JSON value could not be converted to System.Int32. Path: $.sicno | LineNumber: 0 | BytePositionInLine: 18."]}],"isError":true}
how can I handle it? Thank you.
I have a config file in the format below:
Models{
Model1{
Description = "xxxx"
Feature = "yyyy"
EventType = [
"Type1",
"Type2"]
}
Model2{
Description = "aaaa"
Feature = "bbbb"
EventType = [
"Type3",
"Type4"]
}
}
Is there a way to transform this into a dataframe as below?
|Model | Description | Feature | EventType |
------------------------------------------------
|Model1 | xxxx | yyyy | Type1, Type2 |
|Model2 | aaaa | bbbb | Type3, Type4 |
First you should convert it into standard JSON format. You can accomplish that using regex:
with open('untitled.txt') as f:
data = f.read()
import re
# Converting into JSON format
data = re.sub(r'(=\s*".*")\n', r'\1,\n', data)
data = re.sub(r'(Description|Feature|EventType)', r'"\1"', data)
data = re.sub(r'}(\s*Model[0-9]+)', r'},\1', data)
data = re.sub(r'(Model[0-9]+)', r'"\1"=', data)
data = re.sub(r'(Models)', r'', data)
data = re.sub(r'=', r':', data)
Your file will look like this:
{
"Model1":{
"Description" : "xxxx",
"Feature" : "yyyy",
"EventType" : [
"Type1",
"Type2"]
},
"Model2":{
"Description" : "aaaa",
"Feature" : "bbbb",
"EventType" : [
"Type3",
"Type4"]
}
}
Then, use pd.read_json to read it:
import pandas as pd
from io import StringIO
df = pd.read_json(StringIO(data), orient='index').reset_index()
# index Description EventType Feature
#0 Model1 xxxx [Type1, Type2] yyyy
#1 Model2 aaaa [Type3, Type4] bbbb
In the JSON below, I want to access the email-id and 'gamesplayed' field for each user.
"UserTable" : {
"abcd#gmailcom" : {
"gameHistory" : {
"G1" : [ {
"category" : "1",
"questiontext" : "What is the cube of 2 ?"
}, {
"category" : "2",
"questiontext" : "What is the cube of 4 ?"
} ]
},
"gamesplayed" : 2
},
"xyz#gmailcom" : {
"gameHistory" : {
"G1" : [ {
"category" : "1",
"questiontext" : "What is the cube of 2 ?"
}, {
"category" : "2",
"questiontext" : "What is the cube of 4 ?"
} ]
},
"gamesplayed" : 2
}
}
Following is the code that I using to try and access the users email-id:
for user in jp.match("$.UserTable[*].[0]", game_data):
print("User ID's {}".format(user_id))
This is the error I'm getting:
File "C:\ProgramData\Anaconda3\lib\site-packages\jsonpath_rw\jsonpath.py", line 444, in find
return [DatumInContext(datum.value[self.index], path=self, context=datum)]
KeyError: 0
And when I run the following line to and access the 'gamesplayed' field for each user, the IDE Crashes.
print (parser.ExtentedJsonPathParser().parse("$.*.gamesplayed").find(gd_info))
If you like to use JSONPath. Please try this.
Python code:
with open(json_file) as json_file:
raw_data = json.load(json_file)
jsonpath_expr = parse('$.UserTable')
players = [match.value for match in jsonpath_expr.find(raw_data)][0]
emails = players.keys()
result = [{'email': email, 'gamesplayed': players[email]['gamesplayed']} for email in emails ]
print (result)
Output:
[{'email': 'abcd#gmailcom', 'gamesplayed': 2}, {'email': 'xyz#gmailcom', 'gamesplayed': 2}]
Python can handle valid json's as dictionaries. Therefore you have to parse to json string to a python dictionary.
import json
dic = json.loads(json_str)
You can now access a value by using the specific key as an index value = dict[key].
for user in dic:
email = user
gamesplayed = dic[user][gamesplayed]
print("{} played {} game(s).".format(email, gamesplayed))
>>> abcd#gmailcom played 2 game(s).
xyz#gmailcom played 2 game(s).