Normalizing Cloudwatch Log JSON in Python

Normalizing Cloudwatch Log JSON in Python - python

I'm trying to clean AWS Cloudwatch's log data, which is delivered in JSON format when queried via boto3. Each log line is stored as an array of dictionaries. For example, one log line takes the following form:
[
{
"field": "field1",
"value": "abc"
},
{
"field": "field2",
"value": "def"
},
{
"field": "field3",
"value": "ghi"
}
]
If this were in a standard key-value format (e.g., {'field1':'abc'}), I would know exactly what to do with it. I'm just getting stuck on untangling the extra layer of hierarchy introduced by the field/value keys. The ultimate goal is to convert the entire response object into a data frame like the following:
| field1 | field2 | field3 |
|--------|--------|--------|
| abc | def | ghi
(and so on for the rest of the response object, one row per log line.)
Last bit of info: each array has the same set of fields, and there is no nesting deeper than the example I've provided here. Thank you in advance :)

I was able to do this using nested loops. Not my favorite - I always feel like there has to be a more elegant solution than crawling through every single inch of the object, but this data is simple enough that it's still very fast.
logList = [] # Empty array to store list of dictionaries (i.e., log lines)
for line in logs: # logs = response object
line_dict = {}
# Flatten each dict into a single key-value pair
for i in range( len(line) ):
line_dict[ line[i]['field'] ] = line[i]['value']
logList.append(line_dict)
df = pd.json_normalize(logList)
For anyone else working with CloudWatch logs, the actual log lines (like those I displayed above) are nested in an array called 'results' in the boto3 response object. So you'd need to extract that array first, or point the outer loop to it (i.e., for line in response['results']).

Related

How can I best convert an API JSON object to a single row for SQL server?

I have a script setup to pull a JSON from an API and I need to convert objects into different columns for a single row layout for a SQL server. See the example below for the body raw layout of an example object:
"answers": {
"agent_star_rating": {
"question_id": 145,
"question_text": "How satisfied are you with the service you received from {{ employee.first_name }} today?",
"comment": "John was exceptionally friendly and knowledgeable.",
"selected_options": {
"1072": {
"option_id": 1072,
"option_text": "5",
"integer_value": 5
}
}
},
In said example I need the output for all parts of agent_star_rating to be individual columns so all data spits out 1 row for the entire survey on our SQL server. I have tried mapping several keys like so:
agent_star_rating = [list(response['answers']['agent_star_rating']['selected_options'].values())[0]['integer_value']]
agent_question = (response['answers']['agent_star_rating']['question_text'])
agent_comment = (response['answers']['agent_star_rating']['comment'])
response['agent_question'] = agent_question
response['agent_comment'] = agent_comment
response['agent_star_rating'] = agent_star_rating
I get the expected result until we reach a point where some surveys have skipped a field like ['question text'] and we'll get a missing key error. This happens over the course of other objects and I am failing to come up with a solution for these missing keys. If there is a better way to format the output as I've described beyond the keys method I've used I'd also love to hear ideas! I'm fresh to learning python/pandas so pardon any improper terminology!

I would do something like this:
# values that you always capture
row = ['value1', 'value2', ...]
gottem_attrs = {'question_id': '' ,
'question_text': '',
'comment': '',
'selected_options': ''}
# find and save the values that response have
for attr in list(response['agent_star_rating']):
gottem_attrs[attr] = response['agent_star_rating'][attr]
# then you have your final row
final_row = row + gottem_attrs.values()
If the response have a value in his attribute, this code will save it. Else, it will save a empty string for that value.

How to get value from second level Json keys in Python [duplicate]

I have some JSON data like:
{
"status": "200",
"msg": "",
"data": {
"time": "1515580011",
"video_info": [
{
"announcement": "{\"announcement_id\":\"6\",\"name\":\"INS\\u8d26\\u53f7\",\"icon\":\"http:\\\/\\\/liveme.cms.ksmobile.net\\\/live\\\/announcement\\\/2017-08-18_19:44:54\\\/ins.png\",\"icon_new\":\"http:\\\/\\\/liveme.cms.ksmobile.net\\\/live\\\/announcement\\\/2017-10-20_22:24:38\\\/4.png\",\"videoid\":\"15154610218328614178\",\"content\":\"FOLLOW ME PLEASE\",\"x_coordinate\":\"0.22\",\"y_coordinate\":\"0.23\"}",
"announcement_shop": "",
etc.
How do I grab the content "FOLLOW ME PLEASE"? I tried using
replay_data = raw_replay_data['data']['video_info'][0]
announcement = replay_data['announcement']
But now announcement is a string representing more JSON data. I can't continue indexing announcement['content'] results in TypeError: string indices must be integers.
How can I get the desired string in the "right" way, i.e. respecting the actual structure of the data?

In a single line -
>>> json.loads(data['data']['video_info'][0]['announcement'])['content']
'FOLLOW ME PLEASE'
To help you understand how to access data (so you don't have to ask again), you'll need to stare at your data.
First, let's lay out your data nicely. You can either use json.dumps(data, indent=4), or you can use an online tool like JSONLint.com.
{
'data': {
'time': '1515580011',
'video_info': [{
'announcement': ( # ***
"""{
"announcement_id": "6",
"name": "INS\\u8d26\\u53f7",
"icon": "http:\\\\/\\\\/liveme.cms.ksmobile.net\\\\/live\\\\/announcement\\\\/2017-08-18_19:44:54\\\\/ins.png",
"icon_new": "http:\\\\/\\\\/liveme.cms.ksmobile.net\\\\/live\\\\/announcement\\\\/2017-10-20_22:24:38\\\\/4.png",
"videoid": "15154610218328614178",
"content": "FOLLOW ME PLEASE",
"x_coordinate": "0.22",
"y_coordinate": "0.23"
}"""),
'announcement_shop': ''
}]
},
'msg': '',
'status': '200'
}
*** Note that the data in the announcement key is actually more json data, which I've laid out on separate lines.
First, find out where your data resides. You're looking for the data in the content key, which is accessed by the announcement key, which is part of a dictionary inside a list of dicts, which can be accessed by the video_info key, which is in turn accessed by data.
So, in summary, "descend" the ladder that is "data" using the following "rungs" -
data, a dictionary
video_info, a list of dicts
announcement, a dict in the first dict of the list of dicts
content residing as part of json data.
First,
i = data['data']
Next,
j = i['video_info']
Next,
k = j[0] # since this is a list
If you only want the first element, this suffices. Otherwise, you'd need to iterate:
for k in j:
...
Next,
l = k['announcement']
Now, l is JSON data. Load it -
import json
m = json.loads(l)
Lastly,
content = m['content']
print(content)
'FOLLOW ME PLEASE'
This should hopefully serve as a guide should you have future queries of this nature.

You have nested JSON data; the string associated with the 'annoucement' key is itself another, separate, embedded JSON document.
You'll have to decode that string first:
import json
replay_data = raw_replay_data['data']['video_info'][0]
announcement = json.loads(replay_data['announcement'])
print(announcement['content'])
then handle the resulting dictionary from there.

The content of "announcement" is another JSON string. Decode it and then access its contents as you were doing with the outer objects.

Parse Json file and save specific values [duplicate]

This question already has answers here:
Getting a list of values from a list of dicts
(10 answers)
Closed 5 years ago.
I have this JSON file where the amount of id's sometimes changes (more id's will be added):
{
"maps": [
{
"id": "blabla1",
"iscategorical": "0"
},
{
"id": "blabla2",
"iscategorical": "0"
},
{
"id": "blabla3",
"iscategorical": "0"
},
{
"id": "blabla4",
"iscategorical": "0"
}
]
}
I have this python code that has to print all the values of ids:
import json
data = json.load(open('data.json'))
variable1 = data["maps"][0]["id"]
print(variable1)
variable2 = data["maps"][1]["id"]
print(variable2)
variable3 = data["maps"][2]["id"]
print(variable3)
variable4 = data["maps"][3]["id"]
print(variable4)
I have to use variables, because i want to show the values in a dropdown menu. Is it possible to save the values of the id's in a more efficient way? How do you know the max amount of id's of this json file (in de example 4)?

You can get the number of id (which is the number of elements) by checking the length of data['maps']:
number_of_ids = len(data['maps'])
A clean way to get all the id values is storing them in a list.
You can achieve this in a pythonic way like this:
list_of_ids = [map['id'] for map in data['maps']]
Using this approach you don't even need to store the number of elements in the original json, because you iterate through all of them using a foreach approach, essentially.
If the pythonic approach troubles you, you can achieve the same thing with a classic foreach approach doing so:
list_of_ids = []
for map in data['maps']:
list_of_ids.append(map['id'])
Or you can do with a classic for loop, and here is where you really need the length:
number_of_ids = len(data['maps'])
list_of_ids = []
for i in range(0,number_of_ids):
list_of_ids.append(data['maps'][i]['id'])
This last is the classic way, but I suggest you to take the others approaches in order to leverage the advantages python offers to you!
You can find more on this stuff here!
Happy coding!

data['maps'] is a simple list, so you can iterate over it as such:
for map in data['maps']:
print(map['id'])
To store them in a variable, you'll need to output them to a list. Storing them each in a separate variable is not a good idea, because like you said, you don't have a way to know how many there are.
ids = []
for map in data['maps']:
ids.append(map['id'])

Parsing JSON output efficiently in Python?

The below block of code works however I'm not satisfied that it is very optimal due to my limited understanding of using JSON but I can't seem to figure out a more efficient method.
The steam_game_db is like this:
{
"applist": {
"apps": [
{
"appid": 5,
"name": "Dedicated Server"
},
{
"appid": 7,
"name": "Steam Client"
},
{
"appid": 8,
"name": "winui2"
},
{
"appid": 10,
"name": "Counter-Strike"
}
]
}
}
and my Python code so far is
i = 0
x = 570
req_name_from_id = requests.get(steam_game_db)
j = req_name_from_id.json()
while j["applist"]["apps"][i]["appid"] != x:
i+=1
returned_game = j["applist"]["apps"][i]["name"]
print(returned_game)
Instead of looping through the entire app list is there a smarter way to perhaps search for it? Ideally the elements in the data structure with 'appid' and 'name' were numbered the same as their corresponding 'appid'
i.e.
appid 570 in the list is Dota2
However element 570 in the data structure in appid 5069 and Red Faction
Also what type of data structure is this? Perhaps it has limited my searching ability for this answer already. (I.e. seems like a dictionary of 'appid' and 'element' to me for each element?)
EDIT: Changed to a for loop as suggested
# returned_id string for appid from another query
req_name_from_id = requests.get(steam_game_db)
j_2 = req_name_from_id.json()
for app in j_2["applist"]["apps"]:
if app["appid"] == int(returned_id):
returned_game = app["name"]
print(returned_game)

The most convenient way to access things by a key (like the app ID here) is to use a dictionary.
You pay a little extra performance cost up-front to fill the dictionary, but after that pulling out values by ID is basically free.
However, it's a trade-off. If you only want to do a single look-up during the life-time of your Python program, then paying that extra performance cost to build the dictionary won't be beneficial, compared to a simple loop like you already did. But if you want to do multiple look-ups, it will be beneficial.
# build dictionary
app_by_id = {}
for app in j["applist"]["apps"]:
app_by_id[app["appid"]] = app["name"]
# use it
print(app_by_id["570"])
Also think about caching the JSON file on disk. This will save time during your program's startup.

It's better to have the JSON file on disk, you can directly dump it into a dictionary and start building up your lookup table. As an example I've tried to maintain your logic while using the dict for lookups. Don't forget to encode the JSON it has special characters in it.
Setup:
import json
f = open('bigJson.json')
apps = {}
with open('bigJson.json', encoding="utf-8") as handle:
dictdump = json.loads(handle.read())
for item in dictdump['applist']['apps']:
apps.setdefault(item['appid'], item['name'])
Usage 1:
That's the way you have used it
for appid in range(0, 570):
if appid in apps:
print(appid, apps[appid].encode("utf-8"))
Usage 2: That's how you can query a key, using getinstead of [] will prevent a KeyError exception if the appid isn't recorded.
print(apps.get(570, 0))

unable to get to one value from multiple values json python

working on a project and this is driving me nuts , I have search online and found few answer that have work for my other queries that are json related however for this one its a bit of nightmare keep getting TrackStack error
this is my json
ServerReturnJson = {
"personId":"59c16cab-9f28-454e-8c7c-213ac6711dfc",
"persistedFaceIds":["3aaafe27-9013-40ae-8e2a-5803dad90d04"],
"name":"ramsey,",
"userData":null
}
data = responseIdentify.read()
print("The following data return : " + data)
#Parse json data to print just
load = json.loads(data)
print(load[0]['name'])
and thats where my problem is I am unable to get the value form name , try for next statement and then i get this error:
Traceback (most recent call last):
File "C:\Python-Windows\random_test\cogT2.py", line 110, in <module>
for b in load[0]['name']:
KeyError: 0
using this for loop
for b in load[0]['name']:
print b[load]
any support would be most welcome am sure its something simple just can not figure it out.

Understanding how to reference nested dicts and lists in JSON is the hardest part. Here's a few things to consider.
Using your original data
ServerReturnJson = {
"personId":"59c16cab-9f28-454e-8c7c-213ac6711dfc",
"persistedFaceIds":["3aaafe27-9013-40ae-8e2a-5803dad90d04"],
"name":"ramsey,",
"userData":'null'
}
# No index here, just the dictionary key
print(ServerReturnJson['name'])
Added second person by making a list of dicts
ServerReturnJson = [{
"personId":"59c16cab-9f28-454e-8c7c-213ac6711dfc",
"persistedFaceIds":["3aaafe27-9013-40ae-8e2a-5803dad90d04"],
"name":"ramsey",
"userData": 'null'
},
{
"personId": "234123412341234234",
"persistedFaceIds": ["1241234123423"],
"name": "miller",
"userData": 'null'
}
]
# You can use the index here since you have a list of dictionaries
print(ServerReturnJson[1]['name'])
# You can iterate like this
for item in ServerReturnJson:
print(item['name'])

Thanks for your support basically Microsoft Face API is returning back json with no index like Chris said in this first example
The above example works only if you add the following
data = responseIdentify.read() # read incoming respond form server
ServerReturnJson = json.loads(data)
so the complete answer is as follows :
dataJson= {
"personId":"59c16cab-9f28-454e-8c7c-213ac6711dfc",
"persistedFaceIds":["3aaafe27-9013-40ae-8e2a-5803dad90d04"],
"name":"ramsey,",
"userData":'null'
}
# add json load here
ServerReturnJson = json.loads(dataJson)
# No index here, just the dictionary key
print(ServerReturnJson['name'])
credits to Chris thanks , one last thing Chris mention "Understanding how to reference nested dicts and lists in JSON is the hardest part" 100% agreed

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalizing Cloudwatch Log JSON in Python - python

Related

How can I best convert an API JSON object to a single row for SQL server?

How to get value from second level Json keys in Python [duplicate]

Parse Json file and save specific values [duplicate]

Parsing JSON output efficiently in Python?

unable to get to one value from multiple values json python

Categories

Resources