I have a config file in the format below:
Models{
Model1{
Description = "xxxx"
Feature = "yyyy"
EventType = [
"Type1",
"Type2"]
}
Model2{
Description = "aaaa"
Feature = "bbbb"
EventType = [
"Type3",
"Type4"]
}
}
Is there a way to transform this into a dataframe as below?
|Model | Description | Feature | EventType |
------------------------------------------------
|Model1 | xxxx | yyyy | Type1, Type2 |
|Model2 | aaaa | bbbb | Type3, Type4 |
First you should convert it into standard JSON format. You can accomplish that using regex:
with open('untitled.txt') as f:
data = f.read()
import re
# Converting into JSON format
data = re.sub(r'(=\s*".*")\n', r'\1,\n', data)
data = re.sub(r'(Description|Feature|EventType)', r'"\1"', data)
data = re.sub(r'}(\s*Model[0-9]+)', r'},\1', data)
data = re.sub(r'(Model[0-9]+)', r'"\1"=', data)
data = re.sub(r'(Models)', r'', data)
data = re.sub(r'=', r':', data)
Your file will look like this:
{
"Model1":{
"Description" : "xxxx",
"Feature" : "yyyy",
"EventType" : [
"Type1",
"Type2"]
},
"Model2":{
"Description" : "aaaa",
"Feature" : "bbbb",
"EventType" : [
"Type3",
"Type4"]
}
}
Then, use pd.read_json to read it:
import pandas as pd
from io import StringIO
df = pd.read_json(StringIO(data), orient='index').reset_index()
# index Description EventType Feature
#0 Model1 xxxx [Type1, Type2] yyyy
#1 Model2 aaaa [Type3, Type4] bbbb
Related
I'm getting the following records in blob format with a new line separated. Below is an example of two events separated by a newline,
Few things to note here,
In the example below, event(Structure) are in inconsistent. For certain events i will get Channel Id,conversation Id,replyActivity Id,from Id,locale columns and for absent columns i need to populate as null in my data frame.
How will i able to achive this in Pyspark ?
Example:
{
"event":[
{
"name":"Zip/Postal Code",
"count":1
}
],
"internal":{
"data":{
"id":"XXXX",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:48.7734934Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eus2",
"roleInstance":"RD0004FFA145F5",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"f115c4bf-4fa31385d9a8f248",
"parentId":"|f115c4bf-4fa31385d9a8f248."
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Boydton"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000006"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|f115c4bf-4fa31385d9a8f248."
},
{
"Channel ID":"directline"
},
{
"Recipient ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums#Ye6TP1LJz0o"
},
{
"Bot ID":"XXXX"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}{
"event":[
{
"name":"Activity",
"count":1
}
],
"internal":{
"data":{
"id":"992b0fc7-12ec-11eb-b59a-fb2df7d234d8",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:34.3811795Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eastus3",
"roleInstance":"RD00155D33F838",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00",
"parentId":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Washington"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000000"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
{
"Channel ID":"directline"
},
{
"Bot ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}
I need to extract these records in to following table format (Column Name mentioned below),
ActivityId | ActivityType | ChannelId | conversationId | replyActivityId | fromId | locale | recipientId | speak | text | name |eventTime | Date | InstanceId | DialogId | StepName | applicationId | intent | intentScore | entities | question | sentimentLabel | sentimentScore | knowledgeBaseId | answer | articleFound | originalQuestion| question | questionId | score | username | city | province | country | Feedback | Comment | Tag
I took your sample data and created a json file. I read it in with Spark using this code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json('/tmp/data.json')
df.show()
and it gave me:
+--------------------+--------------------+--------------------+
| context| event| internal|
+--------------------+--------------------+--------------------+
|{{Thu 10/15/2020 ...|[{1, Zip/Postal C...| {{1.61, XXXX}}|
|{{Thu 10/15/2020 ...| [{1, Activity}]|{{1.61, 992b0fc7-...|
+--------------------+--------------------+--------------------+
The problem with this format was I was losing metadata. So this made me change the approach. My attempt is to load the JSON as a string column and then parse it later. You can do this by using:
df = spark.read.text('/tmp/data.json')
df.show()
which gives:
+--------------------+
| value|
+--------------------+
|{"event": [{"name...|
|{"event": [{"name...|
+--------------------+
From here, we can use a pandas UDF (or normal UDF) to process it. I will use the Fugue library as a way to easily convert Python and Pandas code to a Pandas UDF, but you can just turn the logic into a Pandas UDF later if you don't want to use Fugue.
Your final schema is very long so I think the concept will be clearer if I just use the first 3 columns. In this snippet I will extract:
ActivityId | ActivityType | ChannelId
In order to prototype, I will convert the original DataFrame to Pandas:
pdf = df.toPandas()
And then I will make a function that holds the logic. Some of this code may be repetitive and you might be able to simplify it with functions. I think this should be enough to illustrate the logic. One of the frustrating pieces was that it's tedious to pull some fields. There are Lists of Dicts that are a bit hard to access, but you can still get it to work.
import json
from typing import List, Dict, Any, Iterable
def process(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
for row in df:
record = json.loads(df[0]["value"])
# Activity Id
activity_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_id = [x for x in activity_id if "Activity ID" in x.keys()]
if len(activity_id) == 1:
activity_id = activity_id[0]['Activity ID']
else:
activity_id = None
# Activity Type
activity_type = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_type = [x for x in activity_type if "Activity Type" in x.keys()]
if len(activity_type) == 1:
activity_type = activity_type[0]['Activity Type']
else:
activity_type = None
# Channel Id
channel_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
channel_id = [x for x in channel_id if "Channel ID" in x.keys()]
if len(channel_id) == 1:
channel_id = channel_id[0]['Channel ID']
else:
channel_id = None
yield {"ActivityId": activity_id,
"ActivityType": activity_type,
"ChannelId": channel_id,
}
This function just converts each row to json and then extracts the relevant fields. You might notice that the input and output types are not Pandas DataFrames. This is okay because Fugue can handle the conversion for us. In order to test this function, we can do:
import fugue.api as fa
schema = "ActivityId:str, ActivityType:str, ChannelId:str"
out = fa.transform(pdf, process, schema=schema)
# output is Pandas
out.head()
and this will adapt the process function to run on Pandas DataFrames. Schema is a requirement for Spark, so Fugue requires it as well. This gives us the following result:
ActivityId ActivityType ChannelId
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
Now that we know it works on Pandas, we can bring it to Spark with the exact same command. We just need to pass in the Spark DataFrame instead.
out = fa.transform(df, process, schema=schema)
# returns a Spark DataFrame
out.show()
Under the hood, Fugue will convert each partition to a List[Dict[str,Any] and then apply the process function. In this case, it is just applied on the default partitions of your DataFrame. The output annotation Iterable[Dict[str,Any] guides Fugue how to bring the result back out to a Spark DataFrame.
I have a JSON file, like this:
{
"data" : [
{ "values" : [ "ColumnHeader1", "ColumnHeader2", "ColumnHeader3" ]},
{ "values" : [ "Row1Column1", "Row1Column2", "Row1Column3" ]},
{ "values" : [ "Row2Column1", "Row2Column2", "Row2Column3" ]}
]
}
I want to transform it, to be like this:
{
data: [
{ "ColumnHeader1" : "Row1Value1", "ColumnHeader2": "Row1Value2", "ColumnHeader3" : "Row1Value3" },
{ "ColumnHeader1" : "Row2Value1", "ColumnHeader2": "Row2Value2", "ColumnHeader3" : "Row2Value3" }
]
}
I did write a Python script for that - but I wonder could something clever be done via jq or pandas ? (or some other Unix tool or Python library...)
A jq-only solution:
def objectify($header):
. as $in
| reduce range(0; $header|length) as $i ({}; .[$header[$i]] = $in[$i] );
.data[0].values as $header
| .data |= (.[1:] | map(.values | objectify($header)) )
If you like nifty:
def objectify($header): with_entries(.key |= $header[.]) ;
So, if you want a two-liner:
.data[0].values as $header
| .data |= (.[1:] | map(.values | with_entries(.key |= $header[.])))
Here is a different jq solution without reduce:
.data |= (
map(.values)
| first as $headers | del(first)
| map(
[ $headers, .]
| transpose
| map({(first): last})
| add
)
)
Output:
{
"data": [
{
"ColumnHeader1": "Row1Column1",
"ColumnHeader2": "Row1Column2",
"ColumnHeader3": "Row1Column3"
},
{
"ColumnHeader1": "Row2Column1",
"ColumnHeader2": "Row2Column2",
"ColumnHeader3": "Row2Column3"
}
]
}
Or to rebuild the result object from scratch:
{
data: (
.data | map(.values)
| first as $headers | del(first)
| map(
[ $headers, .]
| transpose
| map({(first): last})
| add
)
)
}
first as $headers could be rewritten as . as [$headers] or .[0] as $headers. del(first) could be replaced with .[1:].
(not very elegant) Answer
"""
Read a JSON file where the 1st item in array is a set of headers, and the other items are values.
Outputs a JSON file where the other items in array are transposed to use those headers.
"""
import json
from sys import argv
input_json=argv[1]
output_json=argv[2]
data = None
with open(input_json, "r") as infile:
# returns JSON object as a dictionary
data = json.load(infile)
headings = data["data"][0]['values']
new_rows = []
rows = len(data["data"])
for r in range(1, rows):
row = data["data"][r]['values']
new_row = dict()
new_rows.append(new_row)
for h in range(0, len(headings)):
new_row[headings[h]] = row[h]
new_data = dict()
new_data["data"] = new_rows
with open(output_json, "w") as outfile:
json_object = json.dumps(new_data, indent=2)
outfile.write(json_object)
Hoping there is a better way with less code :)
A one liner:
d = {
"data" : [
{ "values" : [ "ColumnHeader1", "ColumnHeader2", "ColumnHeader3" ]},
{ "values" : [ "Row1Column1", "Row1Column2", "Row1Column3" ]},
{ "values" : [ "Row2Column1", "Row2Column2", "Row2Column3" ]}
]
}
d = {"data": [{k: v for k, v in zip(d["data"][0]["values"], row["values"])} for row in d["data"][1:]]}
Outputs:
{'data': [{'ColumnHeader1': 'Row1Column1', 'ColumnHeader2': 'Row1Column2', 'ColumnHeader3': 'Row1Column3'}, {'ColumnHeader1': 'Row2Column1', 'ColumnHeader2': 'Row2Column2', 'ColumnHeader3': 'Row2Column3'}]}
You don't need to iterate all headings. I hope it helps.
data = {
"data": [
{"values": ["ColumnHeader1", "ColumnHeader2", "ColumnHeader3"]},
{"values": ["Row1Column1", "Row1Column2", "Row1Column3"]},
{"values": ["Row2Column1", "Row2Column2", "Row2Column3"]},
]
}
data = data["data"]
headings = data[0]["values"]
rows = data[1:]
new_data = {"data":[dict(zip(headings, row["values"])) for row in rows]}
Here's a solution for jq using transpose and map:
.data |= (map(.values) | transpose
| map([{(.[0]): .[1:][]}]) | transpose
| map(add)
)
{
"data": [
{
"ColumnHeader1": "Row1Column1",
"ColumnHeader2": "Row1Column2",
"ColumnHeader3": "Row1Column3"
},
{
"ColumnHeader1": "Row2Column1",
"ColumnHeader2": "Row2Column2",
"ColumnHeader3": "Row2Column3"
}
]
}
Demo
I'm trying to completely flatten my json response from an API into a pandas dataframe and have no idea how to flatten out a list of objects within the response - This relates to the "Lines" column located in the documentation here and below.
"Lines" : [
{
"Account" : {
"UID" : "17960eb4-3e14-4805-aae2-5b2387da1153",
"Name" : "Trade Debtors",
"DisplayID" : "1-1310",
"URI" : "{cf_uri}/GeneralLedger/Account/17960eb4-3e14-4805-aae2-5b2387da1153"
},
"Amount" : 100,
"IsCredit" : false,
"Job" : null,
"LineDescription" : ""
"ReconciledDate" : null,
"UnitCount": null
},
{
"Account" : {
"UID" : "f7d18c92-ada8-428e-b02a-9223022f84b2",
"Name" : "Late Fees Collected",
"DisplayID" : "4-3000",
"URI" : "{cf_uri}/GeneralLedger/Account/f7d18c92-ada8-428e-b02a-9223022f84b2"
},
"Amount" : 90.91,
"IsCredit" : true,
"Job" : null,
"LineDescription" : "Line 1 testing",
"UnitCount": null
},
{
"Account" : {
"UID" : "5427d47c-499a-4386-ad67-72de39520a00",
"Name" : "GST Collected",
"DisplayID" : "2-1210",
"URI" : "{cf_uri}/GeneralLedger/Account/5427d47c-499a-4386-ad67-72de39520a00"
},
"Amount" : 9.09,
"IsCredit" : true,
"Job" : null,
"LineDescription" : "",
"ReconciledDate" : null,
"UnitCount": null
}
],
My Code:
import pandas as pd
import requests
payload={}
headers = {
'x-myobapi-key': client_id,
'x-myobapi-version': 'v2',
'Accept-Encoding': 'gzip,deflate',
'Authorization': f'Bearer {access_token}'
}
response = requests.request("GET", url, headers=headers, data=payload)
result = response.json()
df = pd.json_normalize(result, 'Items')
while result['NextPageLink'] is not None:
response = requests.request("GET", result['NextPageLink'], headers=headers, data=payload)
result = response.json()
df1 = pd.json_normalize(result, 'Items')
df = df.append(df1)
This code above appends each page of results until there isn't a link, as you can see the following output was able to expand the SourceTransactions columns but not the Lines columns as it appears to be in list format?
In order for me to access lines I need to use the following result["Items"][0]["Lines"] except that's only for the first element
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
| UID | DisplayID | JournalType | DateOccurred | DatePosted | Description | Lines | URI | RowVersion | SourceTransaction.UID | SourceTransaction.TransactionType | SourceTransaction.URI |
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
| a100 | PJ001 | Purchase | 2022-01-01 | 2022-01-01 | Transaction | [{'Account': {'UID': '73971... | https://arl1.api.my... | -139 | e06f592c-23b | Bill | https://arl1.api... |
+------+-----------+-------------+--------------+------------+-------------+--------------------------------+------------------------+------------+-----------------------+-----------------------------------+-----------------------+
For others that stumble across the same problem, turns out reading the documentation helped me - Who knew?!
I just needed to make a few tweaks with the json_normalize function. Also the order you type the meta parameters in matters, so you'll need to ensure your order matches the API documentation.
df = pd.json_normalize(data=result["Items"], record_path='Lines', meta=['UID',
'DisplayID','JournalType', ['SourceTransaction', 'UID'], ['SourceTransaction', 'TransactionType'], ['SourceTransaction',
'URI'], 'DateOccurred','DatePosted','Description','URI','RowVersion'])
I have written a code like :
import pandas as pd
import numpy as np
import json
from flask import Flask,request,jsonify
app = Flask(__name__)
#app.route('/df',methods=['POST','GET'])
def ff():
df = pd.read_csv(r'dataframe_post.csv')
row = [5, 'Sanjeev', 'AE']
df.loc[len(df)] = row
# print(dfs)
ls=list(df.to_dict().values())
return jsonify(ls)
if __name__ == '__main__':
app.run(debug=True)
and I am getting output as :
enter image description here
i.e all data is shown column-wise. But i want to display data as row wise. i.e. each entry individually
like;
[
{
"id": 1,
"name": "Preeti",
"2": "CSE",
},
{
"id": 2,
"name": "Chinky",
"2": "CE",
},
|
|
|
|
|
|
]
and so on.
To return json in your desired format you can use the built in dataframe method instead of listing and jsonifying:
df.to_json(orient="records")
This will give you a json encoded string as in the example below:
df = pd.DataFrame([[5, 'Sanjeev', 'AE'], [6, 'Sven', 'AA']], columns = ["id", "name", "2"])
Which returns:
id name 2
0 5 Sanjeev AE
1 6 Sven AA
And then as JSON:
df.to_json(orient="records")
'[{"id":5,"name":"Sanjeev","2":"AE"},{"id":6,"name":"Sven","2":"AA"}]'
In your df.to_dict call, use to_dict(orient='records') which will build the json row-wise
{
"userName" : "Jhon",
"status" : "success",
"id" : 1234,
"myData" : {
"data1": [1,2,3,4],
"data2": [1,2,3,4],
"data3": [1,2,3,4],
"data4": 25,
"data5" : 12
},
"currentStatus" : true
}
How this data is converted to tabler form?
userName Status Id data1 data2 data3 data4 data5 currentStatus
Jhon success 1234 1 1 1 25 12 true
Jhon success 1234 2 2 2 25 12 true
Jhon success 1234 3 3 3 25 12 true
Jhon success 1234 4 4 4 25 12 true
tabluar form should be above pattern. How can this done using python? can anyone help me out.
To simplify the loop when writing to the CSV file, replace all the dataX items that are just numbers with a list of 4 copies of the number. That way you can index all of them the same way.
import json
import csv
json = '''
{
"userName" : "Jhon",
"status" : "success",
"id" : 1234,
"myData" : {
"data1": [1,2,3,4],
"data2": [1,2,3,4],
"data3": [1,2,3,4],
"data4": 25,
"data5" : 12
},
"currentStatus" : true
}'''
data = json.loads(json)
for key, val in data['myData']:
if type(val) is not list:
data['myData'][key] = [val]*4 # convert scalar to list
with open("output.csv", "w") as f:
csvfile = csv.writer(f)
# write header row
csvfile.writerow(['userName', 'Status', 'Id'] + data['myData'].keys() + ['currentStatus'])
prefix = [data['userName'], data['status'], data['id']
suffix = [data['currentStatus']]
for i in range(4):
row = prefix[:]
for d in data['myData']:
row.append(d[i])
row += suffix
csvfile.writerow(row)
There's probably a simpler way to transpose the dictionary of lists into a 2-dimensional list.
Not a python answer, and just for information, using jq command line parser:
jq -r '(["userName","status","id","data1",
"data2","data3","data4","data5","currentStatus"], # header string
range(0;.myData.data1|length) as $i| # $i=table index
[.userName,.status,.id,.myData.data1[$i],
.myData.data2[$i],.myData.data3[$i],
.myData.data4,.myData.data5,.currentStatus]) | # extract values
#tsv # format as tab separated value
' file | column -t # display in column
This assumes that the number of element of the data1 array is same for all other arrays.