Parsing JSON with python: blank fields - python

I'm having problems while parsing a JSON with python, and now I'm stuck.
The problem is that the entities of my JSON are not always the same. The JSON is something like:
"entries":[
{
"summary": "here is the sunnary",
"extensions": {
"coordinates":"coords",
"address":"address",
"name":"name"
"telephone":"123123"
"url":"www.blablablah"
},
}
]
I can move through the JSON, for example:
for entrie in entries:
name =entrie['extensions']['name']
tel=entrie['extensions']['telephone']
The problem comes because sometimes, the JSON does not have all the "fields", for example, the telephone field, sometimes is missing, so, the script fails with KeyError, because the key telephone is missing in this entry.
So, my question: how could I run this script, leaving a blank space where telephone is missing?
I've tried with:
if entrie['extensions']['telephone']:
tel=entrie['extensions']['telephone']
but I think is not ok.

Use dict.get instead of []:
entries['extensions'].get('telephone', '')
Or, simply:
entries['extensions'].get('telephone')
get will return the second argument (default, None) instead of raising a KeyError when the key is not found.

If the data is missing in only one place, then dict.get can be used to fill-in missing the missing value:
tel = d['entries'][0]['extensions'].get('telelphone', '')
If the problem is more widespread, you can have the JSON parser use a defaultdict or custom dictionary instead of a regular dictionary. For example, given the JSON string:
json_txt = '''{
"entries": [
{
"extensions": {
"telephone": "123123",
"url": "www.blablablah",
"name": "name",
"coordinates": "coords",
"address": "address"
},
"summary": "here is the summary"
}
]
}'''
Parse it with:
>>> class BlankDict(dict):
def __missing__(self, key):
return ''
>>> d = json.loads(json_txt, object_hook=BlankDict)
>>> d['entries'][0]['summary']
u'here is the summary'
>>> d['entries'][0]['extensions']['color']
''
As a side note, if you want to clean-up your datasets and enforce consistency, there is a fine tool called Kwalify that does schema validation on JSON (and on YAML);

There are several useful dictionary features that you can use to work with this.
First off, you can use in to test whether or not a key exists in a dictionary:
if 'telephone' in entrie['extensions']:
tel=entrie['extensions']['telephone']
get might also be useful; it allows you to specify a default value if the key is missing:
tel=entrie['extensions'].get('telephone', '')
Beyond that, you could look into the standard library's collections.defaultdict, but that might be overkill.

Two ways.
One is to make sure that your dictionaries are standard, and when you read them in they have all fields. The other is to be careful when accessing the dictionaries.
Here is an example of making sure your dictionaries are standard:
__reference_extensions = {
# fill in with all standard keys
# use some default value to go with each key
"coordinates" : '',
"address" : '',
"name" : '',
"telephone" : '',
"url" : ''
}
entrie = json.loads(input_string)
d = entrie["extensions"]
for key, value in __reference_extensions:
if key not in d:
d[key] = value
Here is an example of being careful when accessing the dictionaries:
for entrie in entries:
name = entrie['extensions'].get('name', '')
tel = entrie['extensions'].get('telephone', '')

Related

How to print out a value in a json, with only 1 'searchstring'

payload = {
"data": {
"name": "John",
"surname": "Doe"
}
}
print(payload["data"]["name"])
I want to print out the value of 'name' inside the json. I know the way to do it like above. But is there also a way to print out the value of 'name' with only 1 'search string'?
I'm looking for something like this
print(payload["data:name"])
Output:
John
If you were dealing with nested attributes of an object I would suggest operator.attrgetter, however, the itemgetter in the same module does not seems to support nested key access. It is fairly easy to implement something similar tho:
payload = {
"data": {
"name": "John",
"surname": "Doe",
"address": {
"postcode": "667"
}
}
}
def get_key_path(d, path):
# Remember latest object
obj = d
# For each key in the given list of keys
for key in path:
# Look up that key in the last object
if key not in obj:
raise KeyError(f"Object {obj} has no key {key}")
# now we know the key exists, replace
# last object with obj[key] to move to
# the next level
obj = obj[key]
return obj
print(get_key_path(payload, ["data"]))
print(get_key_path(payload, ["data", "name"]))
print(get_key_path(payload, ["data", "address", "postcode"]))
Output:
$ python3 ~/tmp/so.py
{'name': 'John', 'surname': 'Doe', 'address': {'postcode': '667'}}
John
667
You can always later decide on a separator character and use a single string instead of path, however, you need to make sure this character does not appear in a valid key. For example, using |, the only change you need to do in get_key_path is:
def get_key_path(d, path):
obj = d
for key in path.split("|"): # Here
...
There isn't really a way you can do this by using the 'search string'. You can use the get() method, but like getting it using the square brackets, you will have to first parse the dictionary inside the data key.
You could try creating your own function that uses something like:
str.split(sep=None, maxsplit=-1)
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no limit on the number of splits (all possible splits are made).
def get_leaf_value(d, search_string):
if ":" not in search_string:
return d[search_string]
next_d, next_search_string = search_string.split(':', 1)
return get_value(d[next_d], next_search_string)
payload = {
"data": {
"name": "John",
"surname": "Doe"
}
}
print(payload["data"]["name"])
print(get_leaf_value(payload, "data:name"))
Output:
John
John
This approach will only work if your data is completely nested dictionaries like in your example (i.e., no lists in non-leaf nodes) and : is not part of any keys obviously.
Here is an alternative. Maybe an overkill, it depends.
jq uses a single "search string" - an expression called 'jq program' by the author - to extract and transform data. It is a powerful tool meaning the jq program can be quite complex. Reading a good tutorial is almost a must.
import pyjq
payload = ... as posted in the question ...
expr = '.data.name'
name = pyjq.one(expr, payload) # "John"
The original project (written in C) is located here. The python jq libraries are build on top of that C code.

How to use f-string formatting around a string of a dictionary

The following code causes an Invalid Format Specifier when the string is converted to an f-string. I can't pinpoint what the problem is as my quotations look okay.
expected_document = f'{"name":"tenders","value":"chicken","key":"{key}"}'
causes:
> expected_document = f'{"name":"tenders","value":"chicken","key":"{key}"}'
E ValueError: Invalid format specifier
while removing the f:
expected_document = '{"name":"tenders","value":"chicken","key":"{key}"}'
works fine.
Why use f-string at all?
The following code works.
key = 'test'
expected_document = { "name": "tenders", "value": "chicken", "key": key }
print(expected_document)
Output:
{'name': 'tenders', 'value': 'chicken', 'key': 'test'}
Update #1: If you want it as a string and don't want to do type conversion, then...
key = 'test'
expected_document_1 = '{"name":"tenders","value":"chicken","key":"' + key + '"}' # old fashioned way
print(expected_document_1)
expected_document_2 = f'{{"name":"tenders","value":"chicken","key":"{key}"}}' # using f-string
print(expected_document_2)
Output:
{"name":"tenders","value":"chicken","key":"test"}
{"name":"tenders","value":"chicken","key":"test"}
Update #2: #Error - Syntactical Remorse had already suggested the second option of escaping the braces in one of the comments.
I think you can just put the f inside the dict like this:
key = 'test'
expected_document = {"name":"tenders","value":"chicken","key":f"{key}"}
You could compile the dictionary without using interpolation and then convert it to a string.
temporary_variable = {"name": "tenders", "value": "chicken", "key": key}
expected_document = str(temporary_variable)
Or you could even put this in one line.
expected_document = str({"name": "tenders", "value": "chicken", "key": key})
I am using Python 3.6.3 - I don't know how other versions would handle this. A potential drawback of this is that you shouldn't rely on dictionaries to maintain their order, which could cause this to break.

Get the parent key by matching the value using Regular Expression

Consider the below json object, Here I need to take the parent key by matching the value using regular expression.
{
"PRODUCT": {
"attribs": {
"U1": {
"name": "^U.*1$"
},
"U2": {
"name": "^U.*2$"
},
"U3": {
"name": "^U.*3$"
},
"U4": {
"name": "^U.*4$"
},
"U5": {
"name": "^U.*5$"
},
"P1": {
"name": "^P.*1$"
}
}
}
}
I will be passing a String like this "U10001", It should return the key(U1) by matching the regular expression(^U.*1$).
If I am passing a String like this "P200001", It should return the key(P1) by matching the regular expression(^P.*1$).
I am looking for some help regarding the same, Any help is appreciated.
I'm not sure how you are getting your JSON, but you added python as a tag so I'm assuming at somepoint you will have it stored as a string in your code.
First decode the string into a python dict.
import json
my_dict = json.loads(my_json)["PRODUCT"]["attribs"]
If the JSON is formatted as above you should get a dict with keys as your U1, U2, etc.
Now you can use filter in python to apply your regular expression logic, and re to do the actual matching.
import re
test_string = "U10001"
def re_filter(item):
return re.match(item[1]["name"], test_string)
result = filter(re_filter, my_dict.items())
# Just get the matching attribute names
print [i[0] for i in result]
I haven't ran the code so it might need some syntax fixing, but this should give you the general idea. Of course you will need to make it more generic to allow multiple products.
How about this:
import re
my_dict = {...}
def get_key(dict_, test):
return next(k for k, v in dict_.items() if re.match(v['name'], test))
test = "U10001"
result = get_key(my_dict['PRODUCT']['attribs'], test))
print(result) # U1
Can you please elaborate on what you exactly want to design? Here's a quick way to return the desired key.
import re
def getKey(string):
return re.search('^(.\d)\d+', string).group(1)
If you want to loop over the whole json, then load it into dictionary and then loop over the "PRODUCT"->"attribs" dictionary to get required key-
import json, re
f = open('../file/path/here')
d = json.loads(f.read())
patents = d['PRODUCT']['attribs']
for key,val in patent_attribute.items():
patent_group = re.search('^(.\d)\d+', val['name']).group(1) #returns U1 U2,U3,.. or P1,P2,P3,..
#do whatever with patent_group(U1/P1 etc)

Convert a string with the name of a variable to a dictionary

I have a string which is little complex in that, it has some objects embedded as values. I need to convert them to proper dict or json.
foo = '{"Source": "my source", "Input": {"User_id": 18, "some_object": this_is_a_variable_actually}}'
Notice that the last key some_object has a value which is neither a string nor an int. Hence when I do a json.loads or ast.literal_eval, I am getting malformed string errors, and so Converting a String to Dictionary doesn't work.
I have no control over the source of the string.
Is it possible to convert such strings to dictionary
The result I need is a dict like this
dict = {
"Source" : "Good",
"object1": variable1,
"object2": variable2
}
The thing here is I wouldn't know what is variable1 or 2. There is no pattern here.
One point I want to mention here is that, If I can make the variables as just plain strings, that is also fine
For example,
dict = {
"Source" : "Good",
"object1": "variable1",
"object2": "variable2"
}
This will be good for my purpose. Thanks for all the answers.
It's a bit of a kludge using the demjson module which allows you to parse most of a somewhat non-confirming JSON syntax string and lists the errors... You can then use that to replace the invalid tokens found and put quotes around it just so it parses correctly, eg:
import demjson
import re
foo = '{"Source": "my source", "Input": {"User_id": 18, "some_object": this_is_a_variable_actually}}'
def f(json_str):
res = demjson.decode(json_str, strict=False, return_errors=True)
if not res.errors:
return res
for err in res.errors:
var = err.args[1]
json_str = re.sub(r'\b{}\b'.format(var), '"{}"'.format(var), json_str)
return demjson.decode(json_str, strict=False)
res = f(foo)
Gives you:
{'Input': {'User_id': 18, 'some_object': 'this_is_a_variable_actually'}, 'Source': 'my source'}
Note that while this should work in the example data presented, your mileage may vary if there's other nuisances in your input that require further munging.

Create (json-) array from browser query string

From the geolocation api browser query, I get this:
browser=opera&sensor=true&wifi=mac:B0-48-7A-99-BD-86|ss:-72|ssid:Baldur WLAN|age:4033|chan:6&wifi=mac:00-24-FE-A7-BA-94|ss:-83|ssid:wlan23-k!17|age:4033|chan:10&wifi=mac:90-F6-52-3F-60-64|ss:-95|ssid:Baldur WLAN|age:4033|chan:13&device=mcc:262|mnc:7|rt:3&cell=id:15479311|lac:21905|mcc:262|mnc:7|ss:-107|ta:0&location=lat:52.398529|lng:13.107570
I would like to access all the single values local structured. My approach is to create a json array more in depth, than split it up by "&" first and "=" afterwards to get an array of all values in the query. Another approach is to use regex (\w+)=(.*) after splitting by "&" ends in the same depth but I need there more details accessible as datatype.
The resulting array should look like:
{
"browser": ["opera"],
...
"location": [{
"lat": 52.398529,
"lng": 13.107570
}],
...
"wifi": [{
"mac": "00-24-FE-A7-BA-94",
"ss": -83,
...
},
{
"mac": "00-24-FE-A7-BA-94",
"ss": -83,
...
}]
Or something similar that I can parse with an additional json library to access the values using python. Can anyone help with this?
Here a solution passing from a dictionary
import re
import json
transform a string to a dictionary, sepfield is the field separator,
def str_to_dict(s, sepfield, sepkv, infields=None):
""" transform a string to a dictionary
s: the string to transform
sepfield: the string with the field separator char
sepkv: the string with the key value separator
infields: a function to be applied to the values
if infields is defined a list of elements with common keys returned
for each key, otherwise the value is associated to the key as it is"""
pattern = "([^%s%s]*?)%s([^%s]*)" % (sepkv, sepfield, sepkv, sepfield)
matches = re.findall(pattern, s)
if infields is None:
return dict(matches)
else:
r=dict()
for k,v in matches:
parsedval=infields(v)
if k not in r:
r[k] = []
r[k].append(parsedval)
return r
def second_level_parsing(x):
return x if x.find("|")==-1 else str_to_dict(x, "|",":")
json.dumps(str_to_dict(s, "&", "=", second_level_parsing))
You can easily extend for multiple levels. Note that the different behaviour whether the infields function is defined or not is to match the output you asked for.

Categories

Resources