Splitting Dictionary on Bytes - python

I have some python code that:
Pulls various metrics from different endpoints
Joins them in a common dictionary with some standardized key/values
Uploads the dictionary to another tool for analysis
While this generally works, there are issues when the dictionary gets too large, it causes performance issues in various points.
I've seen examples using itertools to split based on ranges of keys, to evenly split based on number of keys. However, I would like to try and split it based on the size in bytes, as some of the metrics are drastically larger than others.
Can a dictionary be dynamically split into a list of dictionaries based on the size in bytes?

Assuming that both keys and values are sane types that you can call sys.getsizeof on in a meaningful way, and all distinct objects, you can use that information to split your dictionary into equal-ish chunks.
First compute the total size if you want the max chunk to be a divisor of that. If your maximum size is fixed externally, you can skip this step:
total_size = sum(getsizeof(k) + getsizeof(v) for k, v in my_dict.items())
Now you can iterate the dictionary, assuming approximately random distribution of sizes throughout, cutting a new dict before you exceed the max_size threshold:
from sys import getsizeof
def split_dict(d, max_size):
result = []
current_size = 0
current_dict = {}
while d:
k, v = d.popitem()
increment = getsizeof(k) + getsizeof(v)
if increment + current_size > max_size:
result.append(current_dict)
if current_size:
current_dict = {k: v}
current_size = increment
else:
current_dict[k] = v # going to list
current_dict = {}
current_size = 0
else:
current_dict[k] = v
current_size += increment
if current_dict:
result.append(current_dict)
return result
Keep in mind that dict.popitem is descructive: you are actually removing everything from my_dict to populate the smaller versions.
Here is a highly simplified example:
>>> from string import ascii_letters
>>> d = {s: i for i, s in enumerate(ascii_letters)}
>>> total_size = sum(getsizeof(k) + getsizeof(v) for k, v in d.items())
>>> split_dict(d, total_size // 5)
[{'Z': 51, 'Y': 50, 'X': 49, 'W': 48, 'V': 47, 'U': 46, 'T': 45, 'S': 44, 'R': 43, 'Q': 42},
{'P': 41, 'O': 40, 'N': 39, 'M': 38, 'L': 37, 'K': 36, 'J': 35, 'I': 34, 'H': 33, 'G': 32},
{'F': 31, 'E': 30, 'D': 29, 'C': 28, 'B': 27, 'A': 26, 'z': 25, 'y': 24, 'x': 23, 'w': 22},
{'v': 21, 'u': 20, 't': 19, 's': 18, 'r': 17, 'q': 16, 'p': 15, 'o': 14, 'n': 13, 'm': 12},
{'l': 11, 'k': 10, 'j': 9, 'i': 8, 'h': 7, 'g': 6, 'f': 5, 'e': 4, 'd': 3, 'c': 2},
{'b': 1, 'a': 0}]
As you can see, the split is not necessarily optimal in terms of distribution, but it ensures that no chunk is bigger than max_size, unless one single entry requires more bytes than that.
Update For Not-Sane Values
If you have arbitrarily large nested values, you can still split at the top level, however, you will have to replace getsizeof(v) with something more robust. For example:
from collections.abc import Mapping, Iterable
def xgetsizeof(x):
if isinstance(x, Mapping):
return getsizeof(x) + sum(xgetsizeof(k) + xgetsizeof(v) for k, v in x.items())
if isinstance(x, Iterable) and not isintance(x, str):
return getsizeof(x) + sum(xgetizeof(e) for e in x)
return getsizeof(x)
Now you can also compute total_size with a single call:
total_size = xgetsizeof(d)
Notice that this is bigger than the value you saw before. The earlier result was
xgetsizeof(d) - getsizeof(d)
To make the solution really robust, you would need to add instance tracking to avoid circular references and double-counting.
I went ahead and wrote such a function for my library haggis, called haggis.objects.getsizeof. It behaves largely like xgetsizeof above, but much more robustly.

Related

Writing a program that displays the number of observations from A-Z without distinguishing between lowercase and uppercase

I am writing a program that imports from IDLE 'import this'. I want to print the number of letters from the text(The program should not distinguish between Lowercase and Uppercase)
ex. "Hello world" --> [h= 1, e= 1, l= 3...]
This is what I found when searching for a solution
from collections import Counter
test_str = '''
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
'''
res = Counter(test_str)
print ("The characters are:\n "
+ str(res))
Unfortunately, this count distinguishes between lower cases and upper cases, does anyone have a better idea?
The code from above prints this:
The characters are:
Counter({' ': 124, 'e': 90, 't': 76, 'i': 50, 'a': 50, 'o': 43, 's': 43, 'n': 40, 'l': 33, 'r': 32, 'h': 31, '\n': 22, 'b': 20, 'u': 20, 'p': 20, '.': 18, 'y': 17, 'm': 16, 'c': 16, 'd': 16, 'f': 11, 'g': 11, 'x': 6, '-': 6, 'v': 5, ',': 4, "'": 4, 'w': 4, 'T': 3, 'S': 3, 'A': 3, 'I': 3, 'P': 2, 'E': 2, 'k': 2, 'N': 2, '*': 2, 'Z': 1, 'B': 1, 'C': 1, 'F': 1, 'R': 1, 'U': 1, 'D': 1, '!': 1})
test_str.lower().
Also, you can count without importing Counter, using len()

Assign integers to alphabets and add those integers

I am trying to assign numbers 1-26 to alphabets a-z and add up those numbers according to any given string without any success. For example: a = 1, b=2, c=3. So, if any given string is "abc", the output should be 1+2+3=6.
Programming background - Novice, self-learning.
I have only learned upto strings, lists and their corresponding methods in python programming. I haven't studied functions and classes yet, so please make your answers as simple as possible.
So far I've tried
Name = "abc"
a,b,c = [1,2,3]
Sum_of_name = ""
For alphabet in abc:
Sum_of_name = sum_of_name + alphabet
Print(sum_of_name)
Prints out the same abc.
I realise that when I iterate the string "abc", the string is different than the variables a,b and c. Thus, the integers aren't assigned to the strings and can't be added up.
Any suggestions on how I can work through this with my current level of knowledge.
This is one approach.
Demo:
from string import ascii_lowercase
d = {v: i for i,v in enumerate(ascii_lowercase, 1)}
Name = "abc"
print( sum(d[i] for i in Name) )
Output:
6
First make a list of the letters
>>> from string import ascii_lowercase as alphabet
>>> alphabet
'abcdefghijklmnopqrstuvwxyz'
Then make a lookup of letter to value (there are other ways to do this)
>>> values = {letter: value for value, letter in enumerate(alphabet, 1)}
>>> values
{'d': 4, 'f': 6, 'o': 15, 'b': 2, 's': 19, 'c': 3, 'w': 23, 'q': 17, 'v': 22, 'p': 16, 'i': 9, 'e': 5, 'l': 12, 't': 20, 'y': 25, 'n': 14, 'a': 1, 'r': 18, 'j': 10, 'x': 24, 'g': 7, 'm': 13, 'k': 11, 'h': 8, 'z': 26, 'u': 21}
Then use that to sum values
def sum_letters(word):
return sum(values[letter] for letter in word)
>>> sum_letters('abc')
6
If you have a fixed order then you can use ord()
a="Name"
s=0
for i in a.lower():
s+=ord(i)-96
print(s)
To get the characters in the alphabet, you can use the string lib:
>>> import string
>>> letters = string.lowercase
>>> letters
'abcdefghijklmnopqrstuvwxyz'
We can then turn that into a dictionary to make getting the numeric (positional) value of a letter easy:
letter_map = dict(zip(list(letters), range(1, len(letters) + 1)))
So your function will perform a simple dict lookup for each letter input:
def string_sum(string_input):
return sum(letter_map[char] for char in string_input)
Several test cases:
>>> assert string_sum('abc') == 6
>>> assert string_sum('') == 0 # because it's empty

How to remove the least frequent element from a Counter in Python the fastest way?

I'd like to implement a Counter which drops the least frequent element when the counter's size going beyond some threshold. For that I need to remove the least frequent element.
What is the fastest way to do that in Python?
I know counter.most_common()[-1], but it creates a whole list and seems slow when done extensively? Is there a better command (or maybe a different data structure)?
You may implement least_common by borrowing implementation of most_common and performing necessary changes.
Refer to collections source in Py2.7:
def most_common(self, n=None):
'''List the n most common elements and their counts from the most
common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
'''
# Emulate Bag.sortedByCount from Smalltalk
if n is None:
return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))
To change it in order to retrieve least common we need just a few adjustments.
import collections
from operator import itemgetter as _itemgetter
import heapq as _heapq
class MyCounter(collections.Counter):
def least_common(self, n=None):
if n is None:
return sorted(self.iteritems(), key=_itemgetter(1), reverse=False) # was: reverse=True
return _heapq.nsmallest(n, self.iteritems(), key=_itemgetter(1)) # was _heapq.nlargest
Tests:
c = MyCounter("abbcccddddeeeee")
assert c.most_common() == c.least_common()[::-1]
assert c.most_common()[-1:] == c.least_common(1)
Since your stated goal is to remove items in the counter below a threshold, just reverse the counter (so the values becomes a list of keys with that value) and then remove the keys in the counter below the threshold.
Example:
>>> c=Counter("aaaabccadddefeghizkdxxx")
>>> c
Counter({'a': 5, 'd': 4, 'x': 3, 'c': 2, 'e': 2, 'b': 1, 'g': 1, 'f': 1, 'i': 1, 'h': 1, 'k': 1, 'z': 1})
counts={}
for k, v in c.items():
counts.setdefault(v, []).append(k)
tol=2
for k, v in counts.items():
if k<=tol:
c=c-Counter({}.fromkeys(v, k))
>>> c
Counter({'a': 5, 'd': 4, 'x': 3})
In this example, all counts less than or equal to 2 are removed.
Or, just recreate the counter with a comparison to your threshold value:
>>> c
Counter({'a': 5, 'd': 4, 'x': 3, 'c': 2, 'e': 2, 'b': 1, 'g': 1, 'f': 1, 'i': 1, 'h': 1, 'k': 1, 'z': 1})
>>> Counter({k:v for k,v in c.items() if v>tol})
Counter({'a': 5, 'd': 4, 'x': 3})
If you only want to get the least common value, then the most efficient way to handle this is to simply get the minimum value from the counter (dictionary).
Since you can only say whether a value is the lowest, you actually need to look at all items, so a time complexity of O(n) is really the lowest we can get. However, we do not need to have a linear space complexity, as we only need to remember the lowest value, and not all of them. So a solution that works like most_common() in reverse is too much for us.
In this case, we can simply use min() with a custom key function here:
>>> c = Counter('foobarbazbar')
>>> c
Counter({'a': 3, 'b': 3, 'o': 2, 'r': 2, 'f': 1, 'z': 1})
>>> k = min(c, key=lambda x: c[x])
>>> del c[k]
>>> c
Counter({'a': 3, 'b': 3, 'o': 2, 'r': 2, 'z': 1})
Of course, since dictionaries are unordered, you do not get any influence on which of the lowest values is removed that way in case there are multiple with the same lowest occurrence.

Reading in a pprinted file in Python

I have a long-running script that collates a bunch of data for me. Without thinking too much about it, I set it up to periodically serialize all this data collected out to a file using something like this:
pprint.pprint(data, open('my_log_file.txt', 'w'))
The output of pprint is perfectly valid Python code. Is there an easy way to read in the file into memory so that if I kill the script I can start where I left off? Basically, is there a function which parses a text file as if it were a Python value and returns the result?
If I understand the problem correctly, you are writing one object to a log file? In that case you can simply use eval to turn it back in to a valid python object.
from pprint import pprint
# make some simple data structures
dct = {k: v for k, v in zip('abcdefghijklmnopqrstuvwxyz', range(26))}
# define a filename
filename = '/tmp/foo.txt'
# write them to some log
pprint(dct, open(filename, 'w'))
# open them back out of that log and use the readlines() function
# to let python split on the new lines for us
with open(filename, 'r') as f:
obj = eval(f.read())
print(type(obj))
print(obj)
It gets a little trickier if you are trying to write multiple objects to this file, but that is still doable.
The output of the above script is
<type 'dict'>
{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}
Does this solve your problem?

Efficient way to write this expression: English alphabets dictionary

What would be an efficient and the right way to implement this expression?
{'a': 1, 'b': 2 ... 'z': 26}
I have tried:
x = dict(zip(chr(range(ASCII of A, ASCII of Z)))
Something like this? But I can't figure out the correct expression.
>>> from string import lowercase
>>> dict((j,i) for i,j in enumerate(lowercase, 1))
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9, 'h': 8, 'k': 11, 'j': 10, 'm': 13, 'l': 12, 'o': 15, 'n': 14, 'q': 17, 'p': 16, 's': 19, 'r': 18, 'u': 21, 't': 20, 'w': 23, 'v': 22, 'y': 25, 'x': 24, 'z': 26}
enumerate(lowercase) returns this sequence (0, 'a'), (1, 'b'), (2, 'c'),...
by adding the optional parameter, enumerate starts at 1 instead of 0
enumerate(lowercase, 1) returns this sequence (1, 'a'), (2, 'b'), (3, 'c'),...
The optional parameter is not supported by python older than 2.6, so you could write it this way instead
>>> dict((j,i+1) for i,j in enumerate(lowercase))
dict((chr(x + 96), x) for x in range(1, 27))
You are on the right track, but notice that zip requires a sequence.
So this is what you need:
alphabets = dict(zip([chr(x) for x in range(ord('a'), ord('z')+1)], range(1, 27)))
ord returns the integer ordinal of a one character string. So you can't do a chr(sequence) or an ord(sequence). It has to be a single character, or a single number.
I'm not sure of an exact implementation, but wouldn't it make sense to use the ASCII codes to your advantage as they're in order? Specify the start and end then loop through them adding the ASCII character and the ASCII code minus the starting point.
dictionary comprehension:
{chr(a + 96):a for a in range(1,27)}
>>> {chr(a + 96):a for a in range(1,27)}
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9, 'h': 8, 'k': 11, 'j': 10, 'm': 13, 'l': 12, 'o': 15, 'n': 14, 'q': 17, 'p': 16, 's': 19, 'r': 18, 'u': 21, 't': 20, 'w': 23, 'v': 22, 'y': 25, 'x': 24, 'z': 26}
this only works in versions of python that support dictionary comprehensions, e.g. 3.x and i think 2.7
Guess I didn't reat the question closely enough. Fixed
dict( (chr(x), x-ord('a') +1 ) for x in range(ord('a'), ord('z')+1))
Is a dictionary lookup really what you want?
You can just have a function that does this:
def getNum(ch):
return ord(ch) - ord('a') + 1
This is pretty simple math, so it is possibly more efficient than a dictionary lookup, because the string doesn't need to be hashed and compared.
To do a dictionary lookup, the key you are looking for needs to be hashed, then it needs to find where that hash is in the dictionary. Next, it has to compare the key to the key it found to determine if it is the same or if it is a hash collision. Then, it has to read the value at that location.
The function just needs to do a couple additions. It does have the overhead of a function call though, so that may make it less efficient than a dictionary lookup.
Another thing you may need to consider is what each solution does if the input is invalid (not 'a' - 'z', for example capital 'A'). The dictionary solution would raise a KeyError. You could add code to catch errors if you used a function. If you were to use 'A' with the in-place solution you would get a wrong result, but no error would be raised indicating that you had invalid input.
The point is that in addition to asking "What would be an efficient way to implement this expression?", you should also be asking (at least asking yourself) is "Is this expression really what I want?" and "Is the more efficiency worth the trade-offs?".

Categories

Resources