I have a json file ( ~3Gb ) that I need to load into mongodb. Quite a few of the json keys contain a . (dot), which causes the load into mongodb to fail. I want to the load the json file, and edit the key names in the process, say replace the dot with an empty space. Using the following python code
import json
def RemoveDotKey(dataPart):
for key in dataPart.iterkeys():
new_key = key.replace(".","")
if new_key != key:
newDataPart = deepcopy(dataPart)
newDataPart[new_key] = newDataPart[key]
del newDataPart[key]
return newDataPart
return dataPart
new_json = json.loads(data, object_hook=RemoveDotKey)
The object_hook called RemoveDotKey should iterate over all the keys, it a key contains a dot, create a copy, replace the dot with a space, and return the copy. Created a copy of dataPart, since not sure if I can iterate over dataPart's keys and insert/delete key value pairs at the same time.
There seems to be an error here, all the json keys with a dot in them are not getting edited. I am not very sure how json.load works. Also am new to python ( been using it for less than a week )
You almost had it:
import json
def remove_dot_key(obj):
for key in obj.keys():
new_key = key.replace(".","")
if new_key != key:
obj[new_key] = obj[key]
del obj[key]
return obj
new_json = json.loads(data, object_hook=remove_dot_key)
You were returning a dictionary inside your loop, so you'd only modify one key. And you don't need to make a copy of the values, just rename the keys.
Related
I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.
So i need to parse entries like this:
"Test":{
"entry":{
"Type":"Something"
},
"entry":{
"Type":"Something_Else"
}
}, ...
The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.
My second issue is that this very same json file contains entries like that:
"Test":{
{
"Type":"Something"
}
}
Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.
Thanks in advance
You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.
However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"baz": 42}, "foo": 7}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
[(u'foo', [(u'baz', 42)]), (u'foo', 7)]
How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.
So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.
Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:
from collections import OrderedDict
from json import JSONDecoder
def make_unique(key, dct):
counter = 0
unique_key = key
while unique_key in dct:
counter += 1
unique_key = '{}_{}'.format(key, counter)
return unique_key
def parse_object_pairs(pairs):
dct = OrderedDict()
for key, value in pairs:
if key in dct:
key = make_unique(key, dct)
dct[key] = value
return dct
data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])
The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.
Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.
Thanks a lot #Lukas Graf, i got it working as well by implementing my own version of the hook function
def dict_raise_on_duplicates(ordered_pairs):
count=0
d=collections.OrderedDict()
for k,v in ordered_pairs:
if k in d:
d[k+'_dupl_'+str(count)]=v
count+=1
else:
d[k]=v
return d
Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again
If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:
def dict_raise_on_duplicates(ordered_pairs):
"""Convert duplicate keys to JSON array."""
d = {}
for k, v in ordered_pairs:
if k in d:
if type(d[k]) is list:
d[k].append(v)
else:
d[k] = [d[k],v]
else:
d[k] = v
return d
And then you just use:
dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates)
I have two text files name weburl.txt and imageurl.txt, weburl.txt contain URLs of website and imageurl.txt contain all images URLs I want to create a dictionary that read a line of weburl.txt and make key of a dictionary and imageurl.txt line as a value.
weburl.txt
url1
url2
url3
url4
url5......
imageurl.txt
imgurl1
imgurl2
imgurl3
imgurl4
imgurl5
required output is
{'url1': imgurl1, 'url2': imgurl2, 'url3': imgurl3......}
I am using this code
with open('weburl.txt') as f :
key = f.readlines()
with open('imageurl.txt') as g:
value = g.readlines()
dict[key] = [value]
print dict
I am not getting the required results
you can write something like
with open('weburl.txt') as f, \
open('imageurl.txt') as g:
# we use `str.strip` method
# to remove newline characters
keys = (line.strip() for line in f)
values = (line.strip() for line in g)
result = dict(zip(keys, values))
print(result)
more info about zip at docs
There are problems with the statement dict[key] = [value] on so many levels that I get a kind of vertigo as we drill down through them:
The apparent intention to use a variable called dict (a bad idea because it would overshadow Python's builtin reference to the dict class). Let's call it d instead.
Not initializing the dictionary instance first. If you had called it something liked this oversight would earn you an easy-to-understand NameError. However since you're calling it dict, Python will actually be attempting to set items in the dict class itself (which doesn't support __setitem__) instead of inside a dict instance, so you'll get a different, more-confusing error.
Attempting to make a dict entry assignment where the key is a non-hashable type (key is alist). You could convert thelist to the hashable type tuple easily enough, but that's not what you want because you'd still be...
Attempting to assign bunch of values to their respective keys all at once. This can't be done with d[key] = value syntax. It could be done all in one relatively simple statement, i.e. d=dict(zip(key,value)) but unfortunately that doesn't get around the fact that you're...
Not stripping the newline character off the end of each key and value.
Instead, this line:
d = dict((k.strip(), v.strip()) for k, v in zip(key, value))
will do what you appear to want.
I am new to Python and working on revising some existing code.
There is a JSON string coming into a Python function that looks like this:
{"criteria": {"modelName":"='ALL'", "modelName": "='NEW'","fields":"*"}}
Right now it appears a dictionary is being used to create a string:
crit=data['criteria']
for crit_key in crit
crit_val = crit[crit_key]
sql+ = sql+= ' and ' + crit_key + crit_val
When the sql string is printed, only the last 'modelName' appears. It seems like a dictionary is being used as modelName is a key so the second modelName overwrites the first? I want the "sql" string in the end to contain both modelNames.
edited because of OP comments
Well, if you can't update your JSON and have to deal with.
You can make something like:
data = '{"criteria": {"modelName":"=\'ALL\'", "modelName": "=\'NEW\'","fields":"*"}}'
import json
def dict_raise_on_duplicates(ordered_pairs):
d = {}
duplicated = []
for k, v in ordered_pairs:
if k in d:
if k not in duplicated:
duplicated.append(k)
d[k] = [d[k]] + [v]
else:
d[k].append(v)
else:
d[k] = v
return d
print json.loads(data, object_pairs_hook=dict_raise_on_duplicates)
In this example, data is the JSON string with duplicated keys.
According to json.loads allows duplicate keys in a dictionary, overwriting the first value I just force json.load to handle duplicate keys.
If a duplicated key is spotted, it will create a new list containing current key data and add new value.
After, it will only append news values in created list.
Output:
{u'criteria': {u'fields': u'*', u'modelName': [u"='ALL'", u"='NEW'"]}}
You will have to update your code anyway, but you now can handle it.
I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}
I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.
Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.
What I am asking
Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".
Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):
Link to full JSON sample: http://pastebin.com/0NS5BiDk
What I have done so far:
1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.
r1 = s.post(url2, data=payload1)
j = str(r1.json())
sentences_list = (re.findall(r'\"(.+?)\"', j))
numentries = 0
for sentences in sentences_list:
numentries += 1
print(sentences)
print(numentries)
2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values
def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)
print(get_all(r1.json(), "text"))
It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.
Please advise.
UPDATE
I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.
The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.
For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.
# 1st code (it works but not ideal)
j=r1.json()
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)
# 2nd code I need something to expect missing values and to solve the
# list index error
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)
def order(list_to_order):
try:
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0
print(order(list))
I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.
import json
_NUL = object() # unique value guaranteed to never be in JSON data
def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v
with open('json_sample.txt', 'r') as f:
data = json.load(f)
numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1
print('\nNumber of "text" entries found: {}'.format(numentries))
I am working on getting all text that exists in several .yaml files placed into a new singular YAML file that will contain the English translations that someone can then translate into Spanish.
Each YAML file has a lot of nested text. I want to print the full 'path', aka all the keys, along with the value, for each value in the YAML file. Here's an example input for a .yaml file that lives in the myproject.section.more_information file:
default:
heading: Here’s A Title
learn_more:
title: Title of Thing
url: www.url.com
description: description
opens_new_window: true
and here's the desired output:
myproject.section.more_information.default.heading: Here’s a Title
myproject.section.more_information.default.learn_more.title: Title of Thing
mproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description: description
myproject.section.more_information.default.learn_more.opens_new_window: true
This seems like a good candidate for recursion, so I've looked at examples such as this answer
However, I want to preserve all of the keys that lead to a given value, not just the last key in a value. I'm currently using PyYAML to read/write YAML.
Any tips on how to save each key as I continue to check if the item is a dictionary and then return all the keys associated with each value?
What you're wanting to do is flatten nested dictionaries. This would be a good place to start: Flatten nested Python dictionaries, compressing keys
In fact, I think the code snippet in the top answer would work for you if you just changed the sep argument to ..
edit:
Check this for a working example based on the linked SO answer http://ideone.com/Sx625B
import collections
some_dict = {
'default': {
'heading': 'Here’s A Title',
'learn_more': {
'title': 'Title of Thing',
'url': 'www.url.com',
'description': 'description',
'opens_new_window': 'true'
}
}
}
def flatten(d, parent_key='', sep='_'):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, collections.MutableMapping):
items.extend(flatten(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
results = flatten(some_dict, parent_key='', sep='.')
for item in results:
print(item + ': ' + results[item])
If you want it in order, you'll need an OrderedDict though.
Walking over nested dictionaries begs for recursion and by handing in the "prefix" to "path" this prevents you from having to do any manipulation on the segments of your path (as #Prune) suggests.
There are a few things to keep in mind that makes this problem interesting:
because you are using multiple files can result in the same path in multiple files, which you need to handle (at least throwing an error, as otherwise you might just lose data). In my example I generate a list of values.
dealing with special keys (non-string (convert?), empty string, keys containing a .). My example reports these and exits.
Example code using ruamel.yaml ¹:
import sys
import glob
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap, CommentedSeq
from ruamel.yaml.compat import string_types, ordereddict
class Flatten:
def __init__(self, base):
self._result = ordereddict() # key to list of tuples of (value, comment)
self._base = base
def add(self, file_name):
data = ruamel.yaml.round_trip_load(open(file_name))
self.walk_tree(data, self._base)
def walk_tree(self, data, prefix=None):
"""
this is based on ruamel.yaml.scalarstring.walk_tree
"""
if prefix is None:
prefix = ""
if isinstance(data, dict):
for key in data:
full_key = self.full_key(key, prefix)
value = data[key]
if isinstance(value, (dict, list)):
self.walk_tree(value, full_key)
continue
# value is a scalar
comment_token = data.ca.items.get(key)
comment = comment_token[2].value if comment_token else None
self._result.setdefault(full_key, []).append((value, comment))
elif isinstance(base, list):
print("don't know how to handle lists", prefix)
sys.exit(1)
def full_key(self, key, prefix):
"""
check here for valid keys
"""
if not isinstance(key, string_types):
print('key has to be string', repr(key), prefix)
sys.exit(1)
if '.' in key:
print('dot in key not allowed', repr(key), prefix)
sys.exit(1)
if key == '':
print('empty key not allowed', repr(key), prefix)
sys.exit(1)
return prefix + '.' + key
def dump(self, out):
res = CommentedMap()
for path in self._result:
values = self._result[path]
if len(values) == 1: # single value for path
res[path] = values[0][0]
if values[0][1]:
res.yaml_add_eol_comment(values[0][1], key=path)
continue
res[path] = seq = CommentedSeq()
for index, value in enumerate(values):
seq.append(value[0])
if values[0][1]:
res.yaml_add_eol_comment(values[0][1], key=index)
ruamel.yaml.round_trip_dump(res, out)
flatten = Flatten('myproject.section.more_information')
for file_name in glob.glob('*.yaml'):
flatten.add(file_name)
flatten.dump(sys.stdout)
If you have an additional input file:
default:
learn_more:
commented: value # this value has a comment
description: another description
then the result is:
myproject.section.more_information.default.heading: Here’s A Title
myproject.section.more_information.default.learn_more.title: Title of Thing
myproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description:
- description
- another description
myproject.section.more_information.default.learn_more.opens_new_window: true
myproject.section.more_information.default.learn_more.commented: value # this value has a comment
Of course if your input doesn't have double paths, your output won't have any lists.
By using string_types and ordereddict from ruamel.yaml makes this Python2 and Python3 compatible (you don't indicate which version you are using).
The ordereddict preserves the original key ordering, but this is of course dependent on the processing order of the files. If you want the paths sorted, just change dump() to use:
for path in sorted(self._result):
Also note that the comment on the 'commented' dictionary entry is preserved.
¹ ruamel.yaml is a YAML 1.2 parser that preserves comments and other data on round-tripping (PyYAML does most parts of YAML 1.1). Disclaimer: I am the author of ruamel.yaml
Keep a simple list of strings, being the most recent key at each indentation depth. When you progress from one line to the next with no change, simply change the item at the end of the list. When you "out-dent", pop the last item off the list. When you indent, append to the list.
Then, each time you hit a colon, the corresponding key item is the concatenation of the strings in the list, something like:
'.'.join(key_list)
Does that get you moving at an honorable speed?