How to detect and indent json substrings inside longer non-json text? - python

I have an existing Python application, which logs like:
import logging
import json
logger = logging.getLogger()
some_var = 'abc'
data = {
1: 2,
'blah': {
['hello']
}
}
logger.info(f"The value of some_var is {some_var} and data is {json.dumps(data)}")
So the logger.info function is given:
The value of some_var is abc and data is {1: 2,"blah": {["hello"]}}
Currently my logs go to AWS CloudWatch, which does some magic and renders this with indentation like:
The value of some_var is abc and data is {
1: 2,
"blah": {
["hello"]
}
}
This makes the logs super clear to read.
Now I want to make some changes to my logging, handling it myself with another python script that wraps around my code and emails out logs when there's a failure.
What I want is some way of taking each log entry (or a stream/list of entries), and applying this indentation.
So I want a function which takes in a string, and detects which subset(s) of that string are json, then inserts \n and to pretty-print that json.
example input:
Hello, {"a": {"b": "c"}} is some json data, but also {"c": [1,2,3]} is too
example output
Hello,
{
"a": {
"b": "c"
}
}
is some json data, but also
{
"c": [
1,
2,
3
]
}
is too
I have considered splitting up each entry into everything before and after the first {. Leave the left half as is, and pass the right half to json.dumps(json.loads(x), indent=4).
But what if there's stuff after the json object in the log file?
Ok, we can just select everything after the first { and before the last }.
Then pass the middle bit to the JSON library.
But what if there's two JSON objects in this log entry? (Like in the above example.) We'll have to use a stack to figure out whether any { appears after all prior { have been closed with a corresponding }.
But what if there's something like {"a": "\}"}. Hmm, ok we need to handle escaping.
Now I find myself having to write a whole json parser from scratch.
Is there any easy way to do this?
I suppose I could use a regex to replace every instance of json.dumps(x) in my whole repo with json.dumps(x, indent=4). But json.dumps is sometimes used outside logging statements, and it just makes all my logging lines that extra bit longer. Is there a neat elegant solution?
(Bonus points if it can parse and indent the json-like output that str(x) produces in python. That's basically json with single quotes instead of double.)

In order to extract JSON objects from a string, see this answer. The extract_json_objects() function from that answer will handle JSON objects, and nested JSON objects but nothing else. If you have a list in your log outside of a JSON object, it's not going to be picked up.
In your case, modify the function to also return the strings/text around all the JSON objects, so that you can put them all into the log together (or replace the logline):
from json import JSONDecoder
def extract_json_objects(text, decoder=JSONDecoder()):
pos = 0
while True:
match = text.find('{', pos)
if match == -1:
yield text[pos:] # return the remaining text
break
yield text[pos:match] # modification for the non-JSON parts
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
Use that function to process your loglines, add them to a list of strings, which you then join together to produce a single string for your output, logger, etc.:
def jsonify_logline(line):
line_parts = []
for result in extract_json_objects(line):
if isinstance(result, dict): # got a JSON obj
line_parts.append(json.dumps(result, indent=4))
else: # got text/non-JSON-obj
line_parts.append(result)
# (don't make that a list comprehension, quite un-readable)
return ''.join(line_parts)
Example:
>>> demo_text = """Hello, {"a": {"b": "c"}} is some json data, but also {"c": [1,2,3]} is too"""
>>> print(jsonify_logline(demo_text))
Hello, {
"a": {
"b": "c"
}
} is some json data, but also {
"c": [
1,
2,
3
]
} is too
>>>
Other things not directly related which would have helped:
Instead of using json.dumps(x) for all your log lines, following the DRY principle and create a function like logdump(x) which does whatever you'd want to do, like json.dumps(x), or json.dumps(x, indent=4), or jsonify_logline(x). That way, if you needed to change the JSON format for all your logs, you just change that one function; no need for mass "search & replace", which comes with its own issues and edge-cases.
You can even add an optional parameter to it pretty=True to decide if you want it indented or not.
You could mass search & replace all your existing loglines to do logger.blah(jsonify_logline(<previous log f-string or text>))
If you are JSON-dumping custom objects/class instances, then use their __str__ method to always output pretty-printed JSON. And the __repr__ to be non-pretty/compact.
Then you wouldn't need to modify the logline at all. Doing logger.info(f'here is my object {x}') would directly invoke obj.__str__.

Related

Why is my For loop in python printing three times?

I am trying to loop through a JSON file to get specific values, however, when doing so the loop is printing three times. I only want the value to print once and have tried breaking the loop but it still has not worked.
Python Code:
with open(filename) as json_filez:
dataz = json.load(json_filez)
for i in dataz:
for i in dataz['killRelated']:
print(i["SteamID"])
break
and a snippet of my json file is
{
"killRelated": [
{
"SteamID": "76561198283763531",
"kill": "15,302",
"shotacc": "16.1%"
}
],
"metaData": [
{
"test": "lol"
}
],
"miscData": [
{
"damageGiven": "2,262,638",
"gamePlayed": "1,292",
"moneyEarned": "50,787,000",
"score": "31,122",
"timePlayed": "22d 11h 56m"
}
]
}
and this is my output:
76561198283763531
76561198283763531
76561198283763531
Expected output:
76561198283763531
The return from json.load is a dictionary, and you are only interested in one entry in that, keyed by 'killRelated'. Now the "values" against each dictionary entry are lists, so that is what you need to be iterating though. And each element of such a list is a dictionary that you can again access via a key.
So your code could be:
with open(filename) as json_filez:
dataz = json.load(json_filez)
for kr in dataz['killRelated']: # iterate through the list under the top-level keyword
print (kr["SteamID"])
Now in your sample data, there's only one entry in the dataz['killRelated'] list, so you'll only get that one printed. But in general, you should expect multiple entries - and cater for the possibility of none. You can handle that by try/except of by checking key existence; here's the latter:
with open(filename) as json_filez:
dataz = json.load(json_filez)
if 'killRelated' in dataz: # check for the top keyword
for kr in dataz['killRelated']: # iterate through the list under this keyword
if 'steamID' in kr: # check for the next level keyword
print (kr["SteamID"]) # report it
You were getting three output lines because your outer loop iterated across all keyword entries in dataz (although without examining them), and then each time within that also iterated across the dataz['killRelated'] list. Your addition of break only stopped that inner loop, which for the particular data you had was redundant anyway because it was only going to print one entry.
Your code is correct. You should check your json file or you can share your full JSON text. That would be a problem. I run you code with json snippet you provided and it works as expected.
import json
with open("test.json") as json_filez:
dataz = json.load(json_filez)
for i in dataz:
for i in dataz['killRelated']:
print(i["SteamID"])
and the result as blow:
76561198283763531

How do I extract nested json names and convert to dot notation string list in python?

I need to pull data in from elasticsearch, do some cleaning/munging and export as table/rds.
To do this I have a long list of variable names required to pull from elasticsearch. This list of variables is required for the pull, but the issue is that not all fields may be represented within a given pull, meaning that I need to add the fields after the fact. I can do this using a schema (in nested json format) of the same list of variable names.
To try and [slightly] future proof this work I would ideally like to only maintain the list/schema in one place, and convert from list to schema (or vice-versa).
Is there a way to do this in python? Please see example below of input and desired output.
Small part of schema:
{
"_source": {
"filters": {"group": {"filter_value": 0}},
"user": {
"email": "",
"uid": ""
},
"status": {
"date": "",
"active": True
}
}
}
Desired string list output:
[
"_source.filters.group.filter_value",
"_source.user.email",
"_source.user.uid",
"_source.status.date",
"_source.status.active"
]
I believe that schema -> list might be an easier transformation than list -> schema, though am happy for it to be the other way round if that is simpler (though need to ensure the schema variables have the correct type, i.e. str, bool, float).
I have explored the following answers which come close, but I am struggling to understand since none appear to be in python:
Convert dot notation to JSON
Convert string with dot notation to JSON
Where d is your json as a dictionary,
def full_search(d):
arr = []
def dfs(d, curr):
if not type(d) == dict or curr[-1] not in d or type(d[curr[-1]]) != dict:
arr.append(curr)
return
for key in d[curr[-1]].keys():
dfs(d[curr[-1]], curr + [key])
for key in d.keys():
dfs(d, [key])
return ['.'.join(x) for x in arr]
If d is in json form, use
import json
res = full_search(json.loads(d))

Transform string into access function to access a series in a dictionary

I'm trying to pull data from a dictionary ('data') in which several series are provided:
For instance, equity is extracted with:
data['Financials']['Balance_Sheet']['equity']
As I'm having several functions each calling one different series (e.g. equity, debt, goodwill, etc...), I would like to be able to define the "access" for each of those by defining a string such as:
Goodwill -> data['Financials']['Balance_Sheet']['Goodwill']
Equity->data['Financials']['Balance_Sheet']['Equity']
My idea is to do something like that:
Data_pulled= ACCESS('data['Financials']['Balance_Sheet']['Goodwill']')
What is the ACCESS function required to transform the string into a acccess function?
Hope this is clear! Thanks a lot for your help guys - much appreciated! :)
Max
I question what you're trying to accomplish here. A better answer is probably to write a accessor function that can safely get the field you want without having to type the whole thing out every time.
Consider the following code:
def ACCESS(*strings):
def ACCESS_LAMBDA(dic):
result = dic
for key in strings:
result = result[key]
return result
return ACCESS_LAMBDA
dic = { 'aa': { 'bb': { 'cc': 42 } } }
ABC_ACCESS = ACCESS('aa','bb','cc')
print ABC_ACCESS(dic)
This is called a closure, where you can define a function at runtime. Here you'd create pull_goodwill = ACCESS('Financials','Balance_Sheet','Goodwill') then get the value with Data_pulled = pull_goodwill(data)
This doesn't exactly answer your question, and the star-arguments and Lambda-returned-function are pretty advanced things. But, don't just "call eval()" that's a pretty insecure coding habit to get into. eval() has its uses... But, think about what you're trying to do, and see if there is a simple abstraction that you can program to access the data you want, rather than relying on the python parser to fetch a value from a dict.
edit:
Link to information about closures in python
edit2:
In order to not have to pass a dictionary to the returned lambda-function, you can pass it into the function constructor. Here's what that would look like, note the change to ACCESS's definition now includes dic and that the ACCESS_LAMBDA definition now takes no arguments.
def ACCESS(dic, *strings):
def ACCESS_LAMBDA():
result = dic
for key in strings:
result = result[key]
return result
return ACCESS_LAMBDA
dic = { 'a': { 'b': { 'c': 42 } } }
ABC_ACCESS = ACCESS(dic, 'a','b','c')
print ABC_ACCESS()
(Note here, that if dict is modified, then the ABC_ACCESS value will change. this is because python passes by reference, if you want a constant value you'd need to make a copy of dic.)

Taking a list of functions(with parameters) as input in python

My intention is to do this:
config = {"methods": ["function1(params1)", "function2(params2)"] }
This I read from a json file. So to use it, I need to pass it to another function as follows:
for method in config[methods]:
foo(global()[method])
I know this wont work as the globals only converts the function name from a string to function but i need this to work for functions with parameters as well.
Also I have thought of this:
config = {"methods": [("function1", params1) , ("function2", params2)] }
for method in config[methods]:
foo(global()[method[0]](method[1]))
This will work but I might have some functions for which I wouldn't require parameters. I don't want to have a condition check for whether the second entry in the tuple exists.
Is there any other way around this? I am open to change in the entire approach including the input format. Please do suggest.
Here is a simplified example that works with any number of parameters:
from re import findall
def a(*args):
for i in args:
print i
config = {"methods": ["a('hello')", "a('hello','bye')", "a()"]}
for i in config['methods']:
x,y = findall(r'(.*?)\((.*?)\)', i)[0]
y = [i.strip() for i in y.split(',')]
globals()[x](*y)
[OUTPUT]
'hello'
'hello'
'bye'
DEMO
Here's a slight modification of #sshashank124 answer, which is simpler because it accepts the data differently. I think using 'f(arg1, arg2)' is not so intuitive, and gets very repetitive. Instead, I make it to dictionary pointing to a list of lists, each one represents an execution and only contains the arguments, thusly:
config = { "methods": {"a": [ ["hello"], ["hello","bye"], [] ]} }
means:
a("hello")
a("hello", "bye")
a()
I'm not sure it's any better than Shank's version, but I think it's easier to understand:
def a(*args):
for i in args:
print i
config = { "methods": {"a": [ ["hello"], ['hello','bye'], [] ]} }
for f,args in config['methods'].items():
for arg in args:
globals()[f](*arg)
DEMO

python 3: splitting information and reporting occurrences

Say for example i want to count how many times bob visits sears and walmart how would i do this by creating dictionaries?
information given:
bob:oct1:sears
bob:oct1:walmart
mary:oct2:walmart
don:oct2:sears
bob:oct4:walmart
mary:oct4:sears
Okay, as this might be homework, I’ll try to give you some hints on how to do this. If this is not homework, please say so, and I’ll restore my original answer and example code.
So first of all, you have your data set in a way that each entry is in single line. As we want to work with each data entry on its own, we have to split the original data into each lines. We can use str.split for that.
Each entry is constructed in a simple format name:date:location. So to get each of those segments again, we can use str.split again. Then we end up with separated content for each entry.
To store this, we want to sort the data by the name first. So we choose a dictionary taking the name as the key, and put in the visits as the data. As we don’t care about the date, we can forget about it. Instead we want to count how often a single location occurs for a given name. So what we do, is keep another dictionary using the locations as the key and the visit count as the data. So we end up with a doubled dictionary, looking like this:
{
'bob': {
'sears': 1,
'walmart': 1,
},
'mary': {
...
}
}
So to get the final answers we just look into that dictionary and can immediately read out the values.
#poke provided a nice explanation, here's a corresponding code:
Read input from files provided on command-line or stdin and dump occurrences in json format:
#!/usr/bin/env python
import csv
import fileinput
import json
import sys
from collections import defaultdict
visits = defaultdict(lambda: defaultdict(int))
for name, _, shop in csv.reader(fileinput.input(), delimiter=':'):
visits[name][shop] += 1
# pretty print
json.dump(visits, sys.stdout, indent=2)
Output
{
"bob": {
"sears": 1,
"walmart": 2
},
"don": {
"sears": 1
},
"mary": {
"sears": 1,
"walmart": 1
}
}
This representation allows easily to find out how many visits and where a person had.
If you always know both name and location then you could use a simpler representation:
visits = defaultdict(int)
for name, _, shop in csv.reader(fileinput.input(), delimiter=':'):
visits[name,shop] += 1
print(visits['bob','walmart'])
# -> 2

Categories

Resources