I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.
So i need to parse entries like this:
"Test":{
"entry":{
"Type":"Something"
},
"entry":{
"Type":"Something_Else"
}
}, ...
The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.
My second issue is that this very same json file contains entries like that:
"Test":{
{
"Type":"Something"
}
}
Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.
Thanks in advance
You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.
However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"baz": 42}, "foo": 7}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
[(u'foo', [(u'baz', 42)]), (u'foo', 7)]
How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.
So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.
Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:
from collections import OrderedDict
from json import JSONDecoder
def make_unique(key, dct):
counter = 0
unique_key = key
while unique_key in dct:
counter += 1
unique_key = '{}_{}'.format(key, counter)
return unique_key
def parse_object_pairs(pairs):
dct = OrderedDict()
for key, value in pairs:
if key in dct:
key = make_unique(key, dct)
dct[key] = value
return dct
data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])
The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.
Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.
Thanks a lot #Lukas Graf, i got it working as well by implementing my own version of the hook function
def dict_raise_on_duplicates(ordered_pairs):
count=0
d=collections.OrderedDict()
for k,v in ordered_pairs:
if k in d:
d[k+'_dupl_'+str(count)]=v
count+=1
else:
d[k]=v
return d
Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again
If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:
def dict_raise_on_duplicates(ordered_pairs):
"""Convert duplicate keys to JSON array."""
d = {}
for k, v in ordered_pairs:
if k in d:
if type(d[k]) is list:
d[k].append(v)
else:
d[k] = [d[k],v]
else:
d[k] = v
return d
And then you just use:
dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates)
Related
Due to some poor planning I have a script that expects a python dict with certain keys however, the other script that creates this dict is using a different naming convention.
Unfortunately, due to translations that have already taken place it looks like I'll need to convert the dict keys.
Basically go from
{'oldKey':'data'}
to
{'newKey':'data'}
I was thinking of creating a dict:
{'oldKey':'newKey'}
and iterate through the dict to convert from oldKey to newKey however is this the most efficient/pythonic way to do it?
I can think of a couple of ways to do this which use dictionaries, but one of them might be more efficient depending on the coverage of the key usage.
a) With a dictionary comprehension:
old_dict = {'oldkey1': 'val1', 'oldkey2': 'val2',
'oldkey3': 'val3', 'oldkey4': 'val4',
'oldkey5': 'val5'}
key_map = {'oldkey1': 'newkey1', 'oldkey2': 'newkey2',
'oldkey3': 'newkey3', 'oldkey4': 'newkey4',
'oldkey5': 'newkey5'}
new_dict = {newkey: old_dict[oldkey] for (oldkey, newkey) in key_map.iteritems()}
print new_dict['newkey1']
b) With a simple class that does the mapping. (Note that I have switched the order of key/value in key_map in this example.) This might be more efficient because it will use lazy evaluation - no need to iterate through all the keys - which may save time if not all the keys are used.
class DictMap(object):
def __init__(self, key_map, old_dict):
self.key_map = key_map
self.old_dict = old_dict
def __getitem__(self, key):
return self.old_dict[self.key_map[key]]
key_map = {'newkey1': 'oldkey1',
'newkey2': 'oldkey2',
'newkey3': 'oldkey3',
'newkey4': 'oldkey4',
'newkey5': 'oldkey5'}
new_dict2 = DictMap(key_map, old_dict)
print new_dict2['newkey1']
This will solve your problem:
new_dict={key_map[oldkey]: vals for (oldkey, vals) in old_dict.items()}
I am attempting to wrap an API with the following function. The API has end points that look similar to this:
/users/{ids}
/users/{ids}/permissions
The idea is that I'll be able to pass a dictionary to my function that contains a list of ids and those will be formatted as the API expects:
users = {'ids': [1, 2, 3, 5]}
call_api('/users/{ids}/permissions', users)
Then in call_api, I currently do something like this
def call_api(url, data):
for k, value in data.items():
if "{" + key + "}" in url:
url = url.replace("{"+k+"}", ';'.join(str(x) for x in value))
data.pop(k, None)
This works, but I can't imagine that if statement is efficient.
How can I improve it and have it work in both Python 2.7 and Python 3.5?
I've also been told that changing the dictionary while iterating is bad, but in my tests I've never had an issue. I am poping the value, because I later check if there are unexpected parameters (ie. anything left in data). Is what I'm doing now the right way?
Instead of modifying a dictionary as you iterate over it, creating another object to hold the unused keys is probably the way to go. In Python 3.4+, at least, removing keys during iteration will raise a
RuntimeError: dictionary changed size during iteration.
def call_api(url, data):
unused_keys = set()
for k, value in data.items():
key_pattern = "{" + k + "}"
if key_pattern in url:
formatted_value = ';'.join(map(str, value))
url = url.replace(key_pattern, formatted_value)
else:
unused_keys.add(k)
Also, if you think that you're more likely to run into an unused key, reversing the conditions might be the way to go.
Here is the way to do it. First, the string is parsed for the keys. It then remembers all keys not used in the url and saves it in the side. Lastly, it formats the url with the given parameters of the dict. The function returns the unused variables and the formatted url. If you wish you can remove the unused variables from the dict by iterating over them and deleting from the dict.
Here's some documentation with examples regarding the format syntax.
import string
users = {'ids': [1, 2, 3, 5]}
def call_api(url, data):
data_set = set(data)
formatter = string.Formatter()
used_set = {f[1] for f in formatter.parse(url) if f[1] is not None}
unused_set = data_set - used_set
formatted = url.format(**{k: ";".join(str(x) for x in v)
for k, v in data.items()})
return unused_set, formatted
print(call_api('/users/{ids}/permissions', users))
You could use re.subn which returns the number of replacements made:
import re
def call_api(url, data):
for k, value in list(data.items()):
url, n = re.subn(r'\{%s\}' % k, ';'.join(str(x) for x in value), url)
if n:
del data[k]
Note that for compatibilty with both python2 and python3, it is also necessary to create a copy of the list of items when destructively iterating over the dict.
EDIT:
It seems the main bottleneck is checking that the key is in the url. The in operator is easily the most efficient way to do this, and is much faster than a regex for the simple pattern that is being used here. Recording the unused keys separately is also more efficient than destructive iteration, but it doesn't make as much difference (relatively speaking).
So: there's not much wrong with the original solution, but the one given by #wegry is the most efficient.
The formatting keys can be found with a RegEx and then compared to the keys in the dictionary. Your string is already setup to use str.format, so you apply a transformation to the values in data, and then apply that transformation.
import re
from toolz import valmap
def call_api(url, data):
unused = set(data) - set(re.findall('\{(\w+)\}', url))
url = url.format_map(valmap(lambda v: ';'.join(map(str, v)), data))
return url, unused
The usage looks like:
users = {'ids': [1, 2, 3, 5], 'unused_key': 'value'}
print(call_api('/users/{ids}/permissions', users))
# ('/users/1;2;3;5/permissions', {'unused_key'})
This isn't going to time that well, but it's concise. As noted in one of the comments, it seems unlikely that this method is be a bottleneck.
I have a dictionary that I create like this:
myDict = {}
Then I like to add key in it that corresponds to another dictionary, in which I put another value:
myDict[2000]['hello'] = 50
So when I pass myDict[2000]['hello'] somewhere, it would give 50.
Why isn't Python just creating those entries right there? What's the issue? I thought KeyError only occurs when you try to read an entry that doesn't exist, but I'm creating it right here?
KeyError occurs because you are trying to read a non-existant key when you try to access myDict[2000]. As an alternative, you could use defaultdict:
>>> from collections import defaultdict
>>> myDict = defaultdict(dict)
>>> myDict[2000]['hello'] = 50
>>> myDict[2000]
{'hello': 50}
defaultdict(dict) means that if myDict encounters an unknown key, it will return a default value, in this case whatever is returned by dict() which is an empty dictionary.
What you want is to implement a nested dict:
I recommend this approach:
class Vividict(dict):
def __missing__(self, key):
value = self[key] = type(self)()
return value
From the docs, under d[key]
New in version 2.5: If a subclass of dict defines a method
__missing__(), if the key key is not present, the d[key]
operation calls that method with the key key as argument
To try it:
myDict = Vividict()
myDict[2000]['hello'] = 50
and myDict now returns:
{2000: {'hello': 50}}
And this will work for any arbitrary depth you want:
myDict['foo']['bar']['baz']['quux']
just works.
But you are trying to read an entry that doesn't exist: myDict[2000].
The exact translation of what you say in your code is "give me the entry in myDict with the key of 2000, and store 50 against the key 'hello' in that entry." But myDict doesn't have a key of 2000, hence the error.
What you actually need to do is to create that key. You can do that in one go:
myDict[2000] = {'hello': 50}
According to the below scenario, when you append type new_result into dict, you will get KeyError: 'result'
dict = {}
new_result = {'key1':'new_value1','key2':'new_value'}
dict['result'].append(new_result)
Because key doesn't exist in other words your dict doesn't have a result key. I fixed this problem with defaultdict and their setdefault method.
To try it;
from collections import defaultdict
dict = defaultdict(dict)
new_result = {'key1':'new_value1','key2':'new_value2'}
dict.setdefault('result', []).append(new_result)
You're right, but in your code python has to first get myDict[2000] and then do the assignment. Since that entry doesn't exist it can't assign to its elements
I have a json file ( ~3Gb ) that I need to load into mongodb. Quite a few of the json keys contain a . (dot), which causes the load into mongodb to fail. I want to the load the json file, and edit the key names in the process, say replace the dot with an empty space. Using the following python code
import json
def RemoveDotKey(dataPart):
for key in dataPart.iterkeys():
new_key = key.replace(".","")
if new_key != key:
newDataPart = deepcopy(dataPart)
newDataPart[new_key] = newDataPart[key]
del newDataPart[key]
return newDataPart
return dataPart
new_json = json.loads(data, object_hook=RemoveDotKey)
The object_hook called RemoveDotKey should iterate over all the keys, it a key contains a dot, create a copy, replace the dot with a space, and return the copy. Created a copy of dataPart, since not sure if I can iterate over dataPart's keys and insert/delete key value pairs at the same time.
There seems to be an error here, all the json keys with a dot in them are not getting edited. I am not very sure how json.load works. Also am new to python ( been using it for less than a week )
You almost had it:
import json
def remove_dot_key(obj):
for key in obj.keys():
new_key = key.replace(".","")
if new_key != key:
obj[new_key] = obj[key]
del obj[key]
return obj
new_json = json.loads(data, object_hook=remove_dot_key)
You were returning a dictionary inside your loop, so you'd only modify one key. And you don't need to make a copy of the values, just rename the keys.
I have a django queryset that returns a list of values:
[(client pk, timestamp, value, task pk), (client pk, timestamp, value, task pk),....,].
I am trying to get it to return a dictionary of this format:
{client pk:[[timestamp, value],[timestamp, value],...,], client pk:[list of lists],...,}
The values_list may have multiple records for each client pk. I have been able to get dictionaries of lists for client or task pk using:
def dict_from_vl(vls_list):
keys=[values_list[x][3] for x in range(0,len(values_list),1)]
values = [[values_list[x][1], values_list[x][2]] for x in range(0,len(values_list),1)]
target_dict=dict(zip(keys,values))
return target_dict
However using this method, values for the same key write over previous values as it iterates through the values_list, rather than append them to a list. So this works great for getting the most recent if the values list is sorted oldest records to newest, but not for the purpose of creating a list of lists for the dict value.
Instead of target_dict=dict(zip(keys,values)), do
target_dict = defaultdict(list)
for i, key in enumerate(keys):
target_dict[k].append(values[i])
(defaultdict is available in the standard module collections.)
from collections import defaultdict
d = defaultdict(list)
for x in vls_list:
d[x].append(list(x[1:]))
Although I'm not sure if I got the question right.
I know in Python you're supposed to cram everything into a single line, but you could do it the old fashioned way...
def dict_from_vl(vls_list):
target_dict = {}
for v in vls_list:
if v[0] not in target_dict:
target_dict[v[0]] = []
target_dict[v[0]].append([v[1], v[2]])
return target_dict
For better speed, I suggest you don't create the keys and values lists separately but simply use only one loop:
tgt_dict = defaultdict(list)
for row in vas_list:
tgt_dict[row[0]].append([row[1], row[2]])