How to sort a LARGE dictionary - python

I have a python script that is working with a large (~14gb) textfile. I end up with a dictionary of keys and values, but I am getting a memory error when I try to sort the dictionary by value.
I know the dictionary is too big to load into memory and then sort, but how could I go about accomplishing this?

You can use an ordered key/value store like wiredtiger, leveldb, bsddb. All of them support ordered keys using custom sort function. leveldb is the easiest to use but if you use python 2.7, bsddb is included in the stdlib. If you only require lexicographic sorting you can use the raw hashopen function to open a persistent sorted dictionary:
from bsddb import hashopen
db = hashopen('dict.db')
db['020'] = 'twenty'
db['002'] = 'two'
db['value'] = 'value'
db['key'] = 'key'
print(db.keys())
This outputs
>>> ['002', '020', 'key', 'value']
Don't forget to close the db after your work:
db.close()
Mind the fact that hashopen configuration might not suit your need. In this case I recommend you use leveldb which has a simple API or wiredtiger for speed.
To order by value in bsddb, you have to use the composite key pattern or key composition. Which boils down to create a dictionary key which keeps the ordering you look for. In this example we pack the original dict value first (so that small values appears first) with the original dict key (so that the bsddb key is unique):
import struct
from bsddb import hashopen
my_dict = {'a': 500, 'abc': 100, 'foobar': 1}
# insert
db = hashopen('dict.db')
for key, value in my_dict.iteritems():
composite_key = struct.pack('>Q', value) + key
db[composite_key] = '' # value is not useful in this case but required
db.close()
# read
db = hashopen('dict.db')
for key, _ in db.iteritems(): # iterate over database
size = struct.calcsize('>Q')
# unpack
value, key = key[:size], key[size:]
value = struct.unpack('>Q', value)[0]
print key, value
db.close()
This outputs the following:
foobar 1
abc 100
a 500

Related

How to not remove duplicates automatically when using method json.loads? [duplicate]

I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.
So i need to parse entries like this:
"Test":{
"entry":{
"Type":"Something"
},
"entry":{
"Type":"Something_Else"
}
}, ...
The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.
My second issue is that this very same json file contains entries like that:
"Test":{
{
"Type":"Something"
}
}
Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.
Thanks in advance
You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.
However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"baz": 42}, "foo": 7}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
[(u'foo', [(u'baz', 42)]), (u'foo', 7)]
How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.
So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.
Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:
from collections import OrderedDict
from json import JSONDecoder
def make_unique(key, dct):
counter = 0
unique_key = key
while unique_key in dct:
counter += 1
unique_key = '{}_{}'.format(key, counter)
return unique_key
def parse_object_pairs(pairs):
dct = OrderedDict()
for key, value in pairs:
if key in dct:
key = make_unique(key, dct)
dct[key] = value
return dct
data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])
The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.
Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.
Thanks a lot #Lukas Graf, i got it working as well by implementing my own version of the hook function
def dict_raise_on_duplicates(ordered_pairs):
count=0
d=collections.OrderedDict()
for k,v in ordered_pairs:
if k in d:
d[k+'_dupl_'+str(count)]=v
count+=1
else:
d[k]=v
return d
Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again
If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:
def dict_raise_on_duplicates(ordered_pairs):
"""Convert duplicate keys to JSON array."""
d = {}
for k, v in ordered_pairs:
if k in d:
if type(d[k]) is list:
d[k].append(v)
else:
d[k] = [d[k],v]
else:
d[k] = v
return d
And then you just use:
dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates)

How to associate elements in a set with multiple dict entries

I'm extracting instances of three elements from an XML file: ComponentStr, keyID, and valueStr. Whenever I find a ComponentStr, I want to add/associate the keyID:valueStr to it. ComponentStr values are not unique. As multiple occurrences of a ComponentStr is read, I want to accumulate the keyID:valueStr for that ComponentStr group. The resulting accumulated data structure after reading the XML file might look like this:
ComponentA: key1:value1, key2:value2, key3:value3
ComponentB: key4:value4
ComponentC: key5:value5, key6:value6
After I generate the final data structure, I want to sort the keyID:valueStr entries within each ComponentStr and also sort all the ComponentStrs.
I'm trying to structure this data in Python 2. ComponentStr seem to work well as a set. The keyID:valueStr is clearly a dict. But how do I associate a ComponentStr entry in a set with its dict entries?
Alternatively, is there a better way to organize this data besides a set and associated dict entries? Each keyID is unique. Perhaps I could have one dict of keyID:some combo of ComponentStr and valueStr? After the data structure was built, I could sort it based on ComponentStr first, then perform some type of slice to group the keyID:valueStr and then sort again on the keyID? Seems complicated.
How about a dict of dicts?
data = {
'ComponentA': {'key1':'value1', 'key2':'value2', 'key3':'value3'},
'ComponentB': {'key4':'value4'},
'ComponentC': {'key5':'value5', 'key6':'value6'},
}
It maintains your data structure and mapping. Interestingly enough, the underlying implementation of dicts is similar to the implementation of sets.
This would be easily constructed a'la this pseudo-code:
data = {}
for file in files:
data[get_component(file)] = {}
for key, value in get_data(file):
data[get_component(file)][key] = value
in the case where you have repeated components, you need to have the sub-dict as the default, but add to the previous one if it's there. I prefer setdefault to other solutions like a defaultdict or subclassing dict with a __missing__ as long as I only have to do it once or twice in my code:
data = {}
for file in files:
for key, value in get_data(file):
data.setdefault([get_component(file)], {})[key] = value
It works like this:
>>> d = {}
>>> d.setdefault('foo', {})['bar'] = 'baz'
>>> d
{'foo': {'bar': 'baz'}}
>>> d.setdefault('foo', {})['ni'] = 'ichi'
>>> d
{'foo': {'ni': 'ichi', 'bar': 'baz'}}
alternatively, as I read your comment on the other answer say you need simple code, you can keep it really simple with some more verbose and less optimized code:
data = {}
for file in files:
for key, value in get_data(file):
if get_component(file) not in data:
data[get_component(file)] = {}
data[get_component(file)][key] = value
You can then sort when you're done collecting the data.
for component in sorted(data):
print(component)
print('-----')
for key in sorted(data[component]):
print(key, data[component][key])
I want to accumulate the keyID:valueStr for that ComponentStr group
In this case you want to have the keys of your dictionary as the ComponentStr, accumulating to me immediately goes to a list, which are easily ordered.
Each keyID is unique. Perhaps I could have one dict of keyID:some
combo of ComponentStr and valueStr?
You should store your data in a manner that is the most efficient when you want to retrieve it. Since you will be accessing your data by the component, even though your keys are unique there is no point in having a dictionary that is accessed by your key (since this is not how you are going to "retrieve" the data).
So, with that - how about using a defaultdict with a list, since you really want all items associated with the same component:
from collections import defaultdict
d = defaultdict(list)
with open('somefile.xml', 'r') as f:
for component, key, value in parse_xml(f):
d[component].append((key, value))
Now you have for each component, a list of tuples which are the associated key and values.
If you want to keep the components in the order that they are read from the file, you can use a OrderedDict (also from the collections module), but if you want to sort them in any arbitrary order, then stick with a normal dictionary.
To get a list of sorted component names, just sort the keys of the dictionary:
component_sorted = sorted(d.keys())
For a use case of printing the sorted components with their associated key/value pairs, sorted by their keys:
for key in component_sorted:
values = d[key]
sorted_values = sorted(values, key=lamdba x: x[0]) # Sort by the keys
print('Pairs for {}'.format(key))
for k,v in sorted_values:
print('{} {}'.format(k,v))

Python dictionary key error when assigning - how do I get around this?

I have a dictionary that I create like this:
myDict = {}
Then I like to add key in it that corresponds to another dictionary, in which I put another value:
myDict[2000]['hello'] = 50
So when I pass myDict[2000]['hello'] somewhere, it would give 50.
Why isn't Python just creating those entries right there? What's the issue? I thought KeyError only occurs when you try to read an entry that doesn't exist, but I'm creating it right here?
KeyError occurs because you are trying to read a non-existant key when you try to access myDict[2000]. As an alternative, you could use defaultdict:
>>> from collections import defaultdict
>>> myDict = defaultdict(dict)
>>> myDict[2000]['hello'] = 50
>>> myDict[2000]
{'hello': 50}
defaultdict(dict) means that if myDict encounters an unknown key, it will return a default value, in this case whatever is returned by dict() which is an empty dictionary.
What you want is to implement a nested dict:
I recommend this approach:
class Vividict(dict):
def __missing__(self, key):
value = self[key] = type(self)()
return value
From the docs, under d[key]
New in version 2.5: If a subclass of dict defines a method
__missing__(), if the key key is not present, the d[key]
operation calls that method with the key key as argument
To try it:
myDict = Vividict()
myDict[2000]['hello'] = 50
and myDict now returns:
{2000: {'hello': 50}}
And this will work for any arbitrary depth you want:
myDict['foo']['bar']['baz']['quux']
just works.
But you are trying to read an entry that doesn't exist: myDict[2000].
The exact translation of what you say in your code is "give me the entry in myDict with the key of 2000, and store 50 against the key 'hello' in that entry." But myDict doesn't have a key of 2000, hence the error.
What you actually need to do is to create that key. You can do that in one go:
myDict[2000] = {'hello': 50}
According to the below scenario, when you append type new_result into dict, you will get KeyError: 'result'
dict = {}
new_result = {'key1':'new_value1','key2':'new_value'}
dict['result'].append(new_result)
Because key doesn't exist in other words your dict doesn't have a result key. I fixed this problem with defaultdict and their setdefault method.
To try it;
from collections import defaultdict
dict = defaultdict(dict)
new_result = {'key1':'new_value1','key2':'new_value2'}
dict.setdefault('result', []).append(new_result)
You're right, but in your code python has to first get myDict[2000] and then do the assignment. Since that entry doesn't exist it can't assign to its elements

Adding Multiple Values to a Single Key in Python Dictionary

Python dictionaries really have me today. I've been pouring over stack, trying to find a way to do a simple append of a new value to an existing key in a python dictionary adn I'm failing at every attempt and using the same syntaxes I see on here.
This is what i am trying to do:
#cursor seach a xls file
definitionQuery_Dict = {}
for row in arcpy.SearchCursor(xls):
# set some source paths from strings in the xls file
dataSourcePath = str(row.getValue("workspace_path")) + "\\" + str(row.getValue("dataSource"))
dataSource = row.getValue("dataSource")
# add items to dictionary. The keys are the dayasource table and the values will be definition (SQL) queries. First test is to see if a defintion query exists in the row and if it does, we want to add the key,value pair to a dictionary.
if row.getValue("Definition_Query") <> None:
# if key already exists, then append a new value to the value list
if row.getValue("dataSource") in definitionQuery_Dict:
definitionQuery_Dict[row.getValue("dataSource")].append(row.getValue("Definition_Query"))
else:
# otherwise, add a new key, value pair
definitionQuery_Dict[row.getValue("dataSource")] = row.getValue("Definition_Query")
I get an attribute error:
AttributeError: 'unicode' object has no attribute 'append'
But I believe I am doing the same as the answer provided here
I've tried various other methods with no luck with various other error messages. i know this is probably simple and maybe I couldn't find the right source on the web, but I'm stuck. Anyone care to help?
Thanks,
Mike
The issue is that you're originally setting the value to be a string (ie the result of row.getValue) but then trying to append it if it already exists. You need to set the original value to a list containing a single string. Change the last line to this:
definitionQuery_Dict[row.getValue("dataSource")] = [row.getValue("Definition_Query")]
(notice the brackets round the value).
ndpu has a good point with the use of defaultdict: but if you're using that, you should always do append - ie replace the whole if/else statement with the append you're currently doing in the if clause.
Your dictionary has keys and values. If you want to add to the values as you go, then each value has to be a type that can be extended/expanded, like a list or another dictionary. Currently each value in your dictionary is a string, where what you want instead is a list containing strings. If you use lists, you can do something like:
mydict = {}
records = [('a', 2), ('b', 3), ('a', 4)]
for key, data in records:
# If this is a new key, create a list to store
# the values
if not key in mydict:
mydict[key] = []
mydict[key].append(data)
Output:
mydict
Out[4]: {'a': [2, 4], 'b': [3]}
Note that even though 'b' only has one value, that single value still has to be put in a list, so that it can be added to later on.
Use collections.defaultdict:
from collections import defaultdict
definitionQuery_Dict = defaultdict(list)
# ...

How can I get dictionary key as variable directly in Python (not by searching from value)?

Sorry for this basic question but my searches on this are not turning up anything other than how to get a dictionary's key based on its value which I would prefer not to use as I simply want the text/name of the key and am worried that searching by value may end up returning 2 or more keys if the dictionary has a lot of entries... what I am trying to do is this:
mydictionary={'keyname':'somevalue'}
for current in mydictionary:
result = mydictionary.(some_function_to_get_key_name)[current]
print result
"keyname"
The reason for this is that I am printing these out to a document and I want to use the key name and the value in doing this
I have seen the method below but this seems to just return the key's value
get(key[, default])
You should iterate over keys with:
for key in mydictionary:
print "key: %s , value: %s" % (key, mydictionary[key])
If you want to access both the key and value, use the following:
Python 2:
for key, value in my_dict.iteritems():
print(key, value)
Python 3:
for key, value in my_dict.items():
print(key, value)
The reason for this is that I am printing these out to a document and I want to use the key name and the value in doing this
Based on the above requirement this is what I would suggest:
keys = mydictionary.keys()
keys.sort()
for each in keys:
print "%s: %s" % (each, mydictionary.get(each))
If the dictionary contains one pair like this:
d = {'age':24}
then you can get as
field, value = d.items()[0]
For Python 3.5, do this:
key = list(d.keys())[0]
keys=[i for i in mydictionary.keys()] or
keys = list(mydictionary.keys())
As simple as that:
mydictionary={'keyname':'somevalue'}
result = mydictionary.popitem()[0]
You will modify your dictionary and should make a copy of it first
You could simply use * which unpacks the dictionary keys. Example:
d = {'x': 1, 'y': 2}
t = (*d,)
print(t) # ('x', 'y')
Iterate over dictionary (i) will return the key, then using it (i) to get the value
for i in D:
print "key: %s, value: %s" % (i, D[i])
For python 3
If you want to get only the keys use this. Replace print(key) with print(values) if you want the values.
for key,value in my_dict:
print(key)
What I sometimes do is I create another dictionary just to be able whatever I feel I need to access as string. Then I iterate over multiple dictionaries matching keys to build e.g. a table with first column as description.
dict_names = {'key1': 'Text 1', 'key2': 'Text 2'}
dict_values = {'key1': 0, 'key2': 1}
for key, value in dict_names.items():
print('{0} {1}'.format(dict_names[key], dict_values[key])
You can easily do for a huge amount of dictionaries to match data (I like the fact that with dictionary you can always refer to something well known as the key name)
yes I use dictionaries to store results of functions so I don't need to run these functions everytime I call them just only once and then access the results anytime.
EDIT: in my example the key name does not really matter (I personally like using the same key names as it is easier to go pick a single value from any of my matching dictionaries), just make sure the number of keys in each dictionary is the same
You can do this by casting the dict keys and values to list. It can also be be done for items.
Example:
f = {'one': 'police', 'two': 'oranges', 'three': 'car'}
list(f.keys())[0] = 'one'
list(f.keys())[1] = 'two'
list(f.values())[0] = 'police'
list(f.values())[1] = 'oranges'
if you just need to get a key-value from a simple dictionary like e.g:
os_type = {'ubuntu': '20.04'}
use popitem() method:
os, version = os_type.popitem()
print(os) # 'ubuntu'
print(version) # '20.04'
names=[key for key, value in mydictionary.items()]
if you have a dict like
d = {'age':24}
then you can get key and value by d.popitem()
key, value = d.popitem()
easily change the position of your keys and values,then use values to get key,
in dictionary keys can have same value but they(keys) should be different.
for instance if you have a list and the first value of it is a key for your problem and other values are the specs of the first value:
list1=["Name",'ID=13736','Phone:1313','Dep:pyhton']
you can save and use the data easily in Dictionary by this loop:
data_dict={}
for i in range(1, len(list1)):
data_dict[list1[i]]=list1[0]
print(data_dict)
{'ID=13736': 'Name', 'Phone:1313': 'Name', 'Dep:pyhton': 'Name'}
then you can find the key(name) base on any input value.

Categories

Resources