Complex date range extraction from json using Python

Complex date range extraction from json using Python - python

I want to extract all the dates after 2019-10-21 till today from the below json response using python. I'm very new to Python and just beginning to explore new functions. Can anyone give me a hint to start with?
my api response
response=
{ "id": "100",
"location":
{
"address1" : {"city":"x", "state":"y", "zip":"55"},
"address2" : {"city":"g", "state":"h", "zip":"33"},
},
"date": [
{
"shipping_date": "2020-12-13",
"shipping_name": "xuv"
},
{
"shipping_date": "2014-11-31",
"shipping_name": "yuv"
},
{
"shipping_date": "2020-12-14",
"shipping_name": "puv"
},
{
"shipping_date": "2020-08-22",
"shipping_name": "juv"
},
{
"shipping_date": "2019-10-21",
"shipping_name": "auv"
} ]
}
my output
id | shipping_date | shipping_name
100| 2020-12-13 | xuv
100| 2020-12-14 | puv
100| 2020-08-22 | juv

for data in response["date"]:
print(data["shipping_date"])
Use this code to get all the shipping date.

Building upon Sagun Devkota's answer as OP asked for data between specific date ranges.
You can use time.strptime to parse and compare dates.
Suppose you need everything from 1-1-2019 to 31-12-2020, you can do,
import time
start = time.strptime("2019-1-1","%Y-%m-%d")
end = time.strptime("2020-12-31","%Y-%m-%d")
for data in response["date"]:
date = time.strptime(data["shipping_date"],"%Y-%m-%d")
if date>=start and date<=end:
print(data["shipping_date"])
PS: November has 30 days while the shipping_date at index 1 is 11-31.

Related

By using PySpark how to parse nested JSON (Blob format)

I'm getting the following records in blob format with a new line separated. Below is an example of two events separated by a newline,
Few things to note here,
In the example below, event(Structure) are in inconsistent. For certain events i will get Channel Id,conversation Id,replyActivity Id,from Id,locale columns and for absent columns i need to populate as null in my data frame.
How will i able to achive this in Pyspark ?
Example:
{
"event":[
{
"name":"Zip/Postal Code",
"count":1
}
],
"internal":{
"data":{
"id":"XXXX",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:48.7734934Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eus2",
"roleInstance":"RD0004FFA145F5",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"f115c4bf-4fa31385d9a8f248",
"parentId":"|f115c4bf-4fa31385d9a8f248."
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Boydton"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000006"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|f115c4bf-4fa31385d9a8f248."
},
{
"Channel ID":"directline"
},
{
"Recipient ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums#Ye6TP1LJz0o"
},
{
"Bot ID":"XXXX"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}{
"event":[
{
"name":"Activity",
"count":1
}
],
"internal":{
"data":{
"id":"992b0fc7-12ec-11eb-b59a-fb2df7d234d8",
"documentVersion":"1.61"
}
},
"context":{
"application":{
"version":"Thu 10/15/2020 2:46:54.65 \r\nUTC (fv-az464-530) [Build 174613] [Repo Intercom] [Branch prod] [Commit XXXX] \r\n[IntercomWebUIVersion 1.6.20-169031] [IntercomBotAppTemplatesVersion 1.3.27-165664] \r\n"
},
"data":{
"eventTime":"2020-10-20T15:54:34.3811795Z",
"isSynthetic":false,
"samplingRate":100.0
},
"cloud":{
},
"device":{
"type":"PC",
"roleName":"bc-directline-eastus3",
"roleInstance":"RD00155D33F838",
"screenResolution":{
}
},
"session":{
"isFirst":false
},
"operation":{
"id":"00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00",
"parentId":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
"location":{
"clientip":"0.0.0.0",
"continent":"North America",
"country":"United States",
"province":"Virginia",
"city":"Washington"
},
"custom":{
"dimensions":[
{
"Timestamp":"XXXX"
},
{
"StatusCode":"200"
},
{
"Activity ID":"HR48uEYXuCE1yIsFMLL3X3-j|0000000"
},
{
"From ID":"XXXX"
},
{
"Correlation ID":"|00-508c4cceaa6d954599230123d012265b-5f1d891b61135340-00.2fac18fc_"
},
{
"Channel ID":"directline"
},
{
"Bot ID":"7222C-RG-CAR-MP5-HVC-Chatbot-P-p7rpums"
},
{
"Activity Type":"message"
},
{
"Conversation ID":"HR48uEYXuCE1yIsFMLL3X3-j"
}
]
}
}
}
I need to extract these records in to following table format (Column Name mentioned below),
ActivityId | ActivityType | ChannelId | conversationId | replyActivityId | fromId | locale | recipientId | speak | text | name |eventTime | Date | InstanceId | DialogId | StepName | applicationId | intent | intentScore | entities | question | sentimentLabel | sentimentScore | knowledgeBaseId | answer | articleFound | originalQuestion| question | questionId | score | username | city | province | country | Feedback | Comment | Tag

I took your sample data and created a json file. I read it in with Spark using this code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json('/tmp/data.json')
df.show()
and it gave me:
+--------------------+--------------------+--------------------+
| context| event| internal|
+--------------------+--------------------+--------------------+
|{{Thu 10/15/2020 ...|[{1, Zip/Postal C...| {{1.61, XXXX}}|
|{{Thu 10/15/2020 ...| [{1, Activity}]|{{1.61, 992b0fc7-...|
+--------------------+--------------------+--------------------+
The problem with this format was I was losing metadata. So this made me change the approach. My attempt is to load the JSON as a string column and then parse it later. You can do this by using:
df = spark.read.text('/tmp/data.json')
df.show()
which gives:
+--------------------+
| value|
+--------------------+
|{"event": [{"name...|
|{"event": [{"name...|
+--------------------+
From here, we can use a pandas UDF (or normal UDF) to process it. I will use the Fugue library as a way to easily convert Python and Pandas code to a Pandas UDF, but you can just turn the logic into a Pandas UDF later if you don't want to use Fugue.
Your final schema is very long so I think the concept will be clearer if I just use the first 3 columns. In this snippet I will extract:
ActivityId | ActivityType | ChannelId
In order to prototype, I will convert the original DataFrame to Pandas:
pdf = df.toPandas()
And then I will make a function that holds the logic. Some of this code may be repetitive and you might be able to simplify it with functions. I think this should be enough to illustrate the logic. One of the frustrating pieces was that it's tedious to pull some fields. There are Lists of Dicts that are a bit hard to access, but you can still get it to work.
import json
from typing import List, Dict, Any, Iterable
def process(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
for row in df:
record = json.loads(df[0]["value"])
# Activity Id
activity_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_id = [x for x in activity_id if "Activity ID" in x.keys()]
if len(activity_id) == 1:
activity_id = activity_id[0]['Activity ID']
else:
activity_id = None
# Activity Type
activity_type = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
activity_type = [x for x in activity_type if "Activity Type" in x.keys()]
if len(activity_type) == 1:
activity_type = activity_type[0]['Activity Type']
else:
activity_type = None
# Channel Id
channel_id = record.get('context', {}).get('custom', {}).get('dimensions', [{}])
channel_id = [x for x in channel_id if "Channel ID" in x.keys()]
if len(channel_id) == 1:
channel_id = channel_id[0]['Channel ID']
else:
channel_id = None
yield {"ActivityId": activity_id,
"ActivityType": activity_type,
"ChannelId": channel_id,
}
This function just converts each row to json and then extracts the relevant fields. You might notice that the input and output types are not Pandas DataFrames. This is okay because Fugue can handle the conversion for us. In order to test this function, we can do:
import fugue.api as fa
schema = "ActivityId:str, ActivityType:str, ChannelId:str"
out = fa.transform(pdf, process, schema=schema)
# output is Pandas
out.head()
and this will adapt the process function to run on Pandas DataFrames. Schema is a requirement for Spark, so Fugue requires it as well. This gives us the following result:
ActivityId ActivityType ChannelId
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
HR48uEYXuCE1yIsFMLL3X3-j|0000006 message directline
Now that we know it works on Pandas, we can bring it to Spark with the exact same command. We just need to pass in the Spark DataFrame instead.
out = fa.transform(df, process, schema=schema)
# returns a Spark DataFrame
out.show()
Under the hood, Fugue will convert each partition to a List[Dict[str,Any] and then apply the process function. In this case, it is just applied on the default partitions of your DataFrame. The output annotation Iterable[Dict[str,Any] guides Fugue how to bring the result back out to a Spark DataFrame.

JSON jq/python file manipulation with specific key name aggregation

I need to modify the structure of this json file:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop"
}
},
{
"id":"9998",
"type":"file_system",
"properties":{
"mount_point":"/opt",
"name":"/dev/mapper/rhel-opt",
"root_container":"3333"
},
"label":"FileSystem"
},
{
"id":"9999",
"type":"file_system",
"properties":{
"mount_point":"/var",
"name":"/dev/mapper/rhel-var",
"root_container":"3333"
},
"label":"FileSystem"
}
]
in order to have this kind of output:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop",
"file_system":[
"/opt",
"/var"
]
}
}
]
The idea is to have, in the new json structure, the visibility of my laptop with the two file-system partition in an array named "file_system".
As you can see the two partition are related to the first by the id and root_container.
So, imagine to have not only one laptop, bat thousands of laptop, with different id and every one of these have different partition, related to the laptop by the root_container key.
Is there an option to do this with jq functions or python script?
Many thanks

You could employ reduce to iterate over the items while extracting their id, mount_point and root_container. Then, if a root_container was present, delete that entry and add its mount_point to the entry whose id matches their root_container. For convenience, I also employed INDEX on the items' id fields to simplify their access as .[$id] and .[$root_container], which had to be undone at the end using map(.).
jq '
reduce .[] as {$id, properties: {$mount_point, $root_container}} (
INDEX(.id);
if $root_container then
del(.[$id])
| .[$root_container].properties.file_system += [$mount_point]
else . end
)
| map(.)
'
[
{
"id": "3333",
"properties": {
"label": "Computer",
"name": "My-Laptop",
"file_system": [
"/opt",
"/var"
]
}
}
]
Demo

How to transpose JSON structs and arrays in PySpark

I have the following Json file that I'm reading into a dataframe.
{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}
How could I manipulate this to get the following output?
Team
Touchdowns
Texans
123
Ravens
456
I'm struggling a bit with whether I need to pivot/transpose the data or if there is a more elegant approach.

Read the multiline json into spark
df = spark.read.json('/path/to/scores.json',multiLine=True)
Schema
df:pyspark.sql.dataframe.DataFrame
details:struct
box:array
element:struct
Touchdowns:string
field:string
name:string
All of the info you want is in the first row, so get that and drill down to details and box and make that your new dataframe.
spark.createDataFrame(df.first()['details']['box']).withColumnRenamed('field','Team').show()
Output
+----------+------+
|Touchdowns| Team|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+

You can use the inline function.
df = spark.read.load(json_file_path, format='json', multiLine=True)
df = df.selectExpr('inline(details.box)').withColumnRenamed('field', 'Team')
df.show(truncate=False)

You can try using a rdd to get the values of box list.
Input JSON
jsonstr="""{
"details": {
"box": [
{
"Touchdowns": "123",
"field": "Texans"
},
{
"Touchdowns": "456",
"field": "Ravens"
}
]
},
"name": "Team"
}"""
Now convert it to an rdd using the keys of dictionary as below -
import json
box_rdd = sc.parallelize(json.loads(jsonstr)['details']['box'])
box_rdd.collect()
Output - [{'Touchdowns': '123', 'field': 'Texans'},
{'Touchdowns': '456', 'field': 'Ravens'}]
Finally create the dataframe with this box_rdd as below -
from pyspark.sql.types import *
schema = StructType([StructField('Touchdowns', StringType(), True), StructField('field', StringType(), True)])
df = spark.createDataFrame(data=box_rdd,schema=schema)
df.show()
+----------+------+
|Touchdowns| field|
+----------+------+
| 123|Texans|
| 456|Ravens|
+----------+------+

control identifier output Tree format need to be converted into Json format

When we get the print control identifier, the output is in a tree format.
Is there a way to print the control identifier output in JSON format.
Current output:
Dialog - 'SP- HS' (L-32000, T-32000, R-31840, B-31972)
['Dialog', 'SP- HSDialog', 'SP- HS']
child_window(title="SP- HS", class_name="#32770")
|
| Static - 'SP' (L-31885, T-31977, R-31641, B-31948)
| ['Static', 'SP', 'SPStatic', 'Static0', 'Static1']
| child_window(title="SP", class_name="Static")
|
| Static - 'HS' (L-31526, T-31977, R-30807, B-31946)
| ['HSStatic', 'HS', 'Static2', 'HSStatic0', 'HSStatic1']
| child_window(title="HS", class_name="Static")
Desired output:
[{
"Name": "SP-HS",
"co-ordinates": "L-32000, T-32000, R-31840, B-31972",
"Alias": "['Dialog', 'SP- HSDialog', 'SP- HS']",
"Title": "SP-HP",
"Class_name": "#32770"
}, {
"Name": "SP",
"co-ordinates": "L-31885, T-31977, R-31641, B-31948",
"Alias": " ['Static', 'SP', 'SPStatic', 'Static0', 'Static1']",
"Title": "SP",
"Class_name": "Static"
}]
I know the hard way, where we use all the string manipulation operation and convert this to JSON. But if there any other easier way, let me know .

Using pandas and json_normalize to flatten nested JSON API response

I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize.
A generic sample of the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post):
{
"per_page": 2,
"total": 1,
"data": [{
"total_time": 0,
"collection_mode": "default",
"href": "https://api.surveymonkey.com/v3/responses/5007154325",
"custom_variables": {
"custvar_1": "one",
"custvar_2": "two"
},
"custom_value": "custom identifier for the response",
"edit_url": "https://www.surveymonkey.com/r/",
"analyze_url": "https://www.surveymonkey.com/analyze/browse/",
"ip_address": "",
"pages": [
{
"id": "103332310",
"questions": [{
"answers": [{
"choice_id": "3057839051"
}
],
"id": "319352786"
}
]
},
{
"id": "44783164",
"questions": [{
"id": "153745381",
"answers": [{
"text": "some_name"
}
]
}
]
},
{
"id": "44783183",
"questions": [{
"id": "153745436",
"answers": [{
"col_id": "1087201352",
"choice_id": "1087201369",
"row_id": "1087201362"
}, {
"col_id": "1087201353",
"choice_id": "1087201373",
"row_id": "1087201362"
}
]
}
]
}
],
"date_modified": "1970-01-17T19:07:34+00:00",
"response_status": "completed",
"id": "5007154325",
"collector_id": "50253586",
"recipient_id": "0",
"date_created": "1970-01-17T19:07:34+00:00",
"survey_id": "105723396"
}
],
"page": 1,
"links": {
"self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
}
}
I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this:
choice_id col_id row_id text question_id page_id response_id
0 3057839051 NaN NaN NaN 319352786 103332310 5007154325
1 NaN NaN NaN some_name 153745381 44783164 5007154325
2 1087201369 1087201352 1087201362 NaN 153745436 44783183 5007154325
3 1087201373 1087201353 1087201362 NaN 153745436 44783183 5007154325
I can get close by running the following code (Python 3.6):
df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)
Which returns:
question_answers question_id id
0 [{'choice_id': '3057839051'}] 319352786 5007154325
1 [{'text': 'some_name'}] 153745381 5007154325
2 [{'col_id': '1087201352', 'choice_id': '108720... 153745436 5007154325
But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values:
answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)
Returns:
choice_id col_id row_id text id questions.id pages.id
0 3057839051 NaN NaN NaN 5007154325 103332310 103332310
1 NaN NaN NaN some_name 5007154325 44783164 44783164
2 1087201369 1087201352 1087201362 NaN 5007154325 44783183 44783183
3 1087201373 1087201353 1087201362 NaN 5007154325 44783183 44783183
A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data.
I'm sure this is possible, but I can't get there. Any examples of how to do this?
Additional context:
I'm trying to create a dataframe of SurveyMonkey API response output.
My long term goal is to re-create the "all responses" excel sheet that their export service provides.
I plan to do this by getting the response dataframe set up (above), and then use .apply() to match responses with their survey structure API output.
I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.

You need to modify the meta parameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with rename:
answers_df = json_normalize(data=so_survey_responses['data'],
record_path=['pages', 'questions', 'answers'],
meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})

There is no way to do this in a completely generic way using json_normalize(). You can use the record_path and meta arguments to indicate how you want the JSON to be processed.
However, you can use the flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage of how to flatten a deeply-nested JSON and convert to a Pandas dataframe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Complex date range extraction from json using Python - python

for data in response["date"]: print(data["shipping_date"]) Use this code to get all the shipping date.

Related

By using PySpark how to parse nested JSON (Blob format)

JSON jq/python file manipulation with specific key name aggregation

How to transpose JSON structs and arrays in PySpark

control identifier output Tree format need to be converted into Json format

Using pandas and json_normalize to flatten nested JSON API response

Categories

Resources