Reading in a pprinted file in Python - python

I have a long-running script that collates a bunch of data for me. Without thinking too much about it, I set it up to periodically serialize all this data collected out to a file using something like this:
pprint.pprint(data, open('my_log_file.txt', 'w'))
The output of pprint is perfectly valid Python code. Is there an easy way to read in the file into memory so that if I kill the script I can start where I left off? Basically, is there a function which parses a text file as if it were a Python value and returns the result?

If I understand the problem correctly, you are writing one object to a log file? In that case you can simply use eval to turn it back in to a valid python object.
from pprint import pprint
# make some simple data structures
dct = {k: v for k, v in zip('abcdefghijklmnopqrstuvwxyz', range(26))}
# define a filename
filename = '/tmp/foo.txt'
# write them to some log
pprint(dct, open(filename, 'w'))
# open them back out of that log and use the readlines() function
# to let python split on the new lines for us
with open(filename, 'r') as f:
obj = eval(f.read())
print(type(obj))
print(obj)
It gets a little trickier if you are trying to write multiple objects to this file, but that is still doable.
The output of the above script is
<type 'dict'>
{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}
Does this solve your problem?

Related

Writing a program that displays the number of observations from A-Z without distinguishing between lowercase and uppercase

I am writing a program that imports from IDLE 'import this'. I want to print the number of letters from the text(The program should not distinguish between Lowercase and Uppercase)
ex. "Hello world" --> [h= 1, e= 1, l= 3...]
This is what I found when searching for a solution
from collections import Counter
test_str = '''
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
'''
res = Counter(test_str)
print ("The characters are:\n "
+ str(res))
Unfortunately, this count distinguishes between lower cases and upper cases, does anyone have a better idea?
The code from above prints this:
The characters are:
Counter({' ': 124, 'e': 90, 't': 76, 'i': 50, 'a': 50, 'o': 43, 's': 43, 'n': 40, 'l': 33, 'r': 32, 'h': 31, '\n': 22, 'b': 20, 'u': 20, 'p': 20, '.': 18, 'y': 17, 'm': 16, 'c': 16, 'd': 16, 'f': 11, 'g': 11, 'x': 6, '-': 6, 'v': 5, ',': 4, "'": 4, 'w': 4, 'T': 3, 'S': 3, 'A': 3, 'I': 3, 'P': 2, 'E': 2, 'k': 2, 'N': 2, '*': 2, 'Z': 1, 'B': 1, 'C': 1, 'F': 1, 'R': 1, 'U': 1, 'D': 1, '!': 1})
test_str.lower().
Also, you can count without importing Counter, using len()

Splitting Dictionary on Bytes

I have some python code that:
Pulls various metrics from different endpoints
Joins them in a common dictionary with some standardized key/values
Uploads the dictionary to another tool for analysis
While this generally works, there are issues when the dictionary gets too large, it causes performance issues in various points.
I've seen examples using itertools to split based on ranges of keys, to evenly split based on number of keys. However, I would like to try and split it based on the size in bytes, as some of the metrics are drastically larger than others.
Can a dictionary be dynamically split into a list of dictionaries based on the size in bytes?
Assuming that both keys and values are sane types that you can call sys.getsizeof on in a meaningful way, and all distinct objects, you can use that information to split your dictionary into equal-ish chunks.
First compute the total size if you want the max chunk to be a divisor of that. If your maximum size is fixed externally, you can skip this step:
total_size = sum(getsizeof(k) + getsizeof(v) for k, v in my_dict.items())
Now you can iterate the dictionary, assuming approximately random distribution of sizes throughout, cutting a new dict before you exceed the max_size threshold:
from sys import getsizeof
def split_dict(d, max_size):
result = []
current_size = 0
current_dict = {}
while d:
k, v = d.popitem()
increment = getsizeof(k) + getsizeof(v)
if increment + current_size > max_size:
result.append(current_dict)
if current_size:
current_dict = {k: v}
current_size = increment
else:
current_dict[k] = v # going to list
current_dict = {}
current_size = 0
else:
current_dict[k] = v
current_size += increment
if current_dict:
result.append(current_dict)
return result
Keep in mind that dict.popitem is descructive: you are actually removing everything from my_dict to populate the smaller versions.
Here is a highly simplified example:
>>> from string import ascii_letters
>>> d = {s: i for i, s in enumerate(ascii_letters)}
>>> total_size = sum(getsizeof(k) + getsizeof(v) for k, v in d.items())
>>> split_dict(d, total_size // 5)
[{'Z': 51, 'Y': 50, 'X': 49, 'W': 48, 'V': 47, 'U': 46, 'T': 45, 'S': 44, 'R': 43, 'Q': 42},
{'P': 41, 'O': 40, 'N': 39, 'M': 38, 'L': 37, 'K': 36, 'J': 35, 'I': 34, 'H': 33, 'G': 32},
{'F': 31, 'E': 30, 'D': 29, 'C': 28, 'B': 27, 'A': 26, 'z': 25, 'y': 24, 'x': 23, 'w': 22},
{'v': 21, 'u': 20, 't': 19, 's': 18, 'r': 17, 'q': 16, 'p': 15, 'o': 14, 'n': 13, 'm': 12},
{'l': 11, 'k': 10, 'j': 9, 'i': 8, 'h': 7, 'g': 6, 'f': 5, 'e': 4, 'd': 3, 'c': 2},
{'b': 1, 'a': 0}]
As you can see, the split is not necessarily optimal in terms of distribution, but it ensures that no chunk is bigger than max_size, unless one single entry requires more bytes than that.
Update For Not-Sane Values
If you have arbitrarily large nested values, you can still split at the top level, however, you will have to replace getsizeof(v) with something more robust. For example:
from collections.abc import Mapping, Iterable
def xgetsizeof(x):
if isinstance(x, Mapping):
return getsizeof(x) + sum(xgetsizeof(k) + xgetsizeof(v) for k, v in x.items())
if isinstance(x, Iterable) and not isintance(x, str):
return getsizeof(x) + sum(xgetizeof(e) for e in x)
return getsizeof(x)
Now you can also compute total_size with a single call:
total_size = xgetsizeof(d)
Notice that this is bigger than the value you saw before. The earlier result was
xgetsizeof(d) - getsizeof(d)
To make the solution really robust, you would need to add instance tracking to avoid circular references and double-counting.
I went ahead and wrote such a function for my library haggis, called haggis.objects.getsizeof. It behaves largely like xgetsizeof above, but much more robustly.

Python- parse .txt files with multiple dictionaries

I have the following as a .txt file
{"a": 1, "b": 2, "c": 3}
{"d": 4, "e": 5, "f": 6}
{"g": 7, "h": 8, "i": 9}
How can I use python to open the file, and write a comma to separate each dictionary?
I.e. what regular expression can find every instance of "} {" and put a comma there?
(the real file is much larger (~10GB), and this issue prevents the file from being a syntactically correct JSON object that I can parse with json.loads())
You can use str.join with ',' as the delimeter, using a line-by-line file read in a generator expression. Then put [] around the contents to make it valid json.
import json
with open(filename, 'r') as f:
contents = '[' + ','.join(line for line in f) + ']'
data = json.loads(n)
This results in data
[
{'a': 1, 'b': 2, 'c': 3},
{'d': 4, 'e': 5, 'f': 6},
{'g': 7, 'h': 8, 'i': 9}
]

What is the difference between giving a string and a list of string(s) to keras tokenizer?

I am working with keras.preprocessing for tokenize sentences, I encountered an unexpected case in keras.preprocessing.text.Tokenize. When I give it string, the output of word_index is a dictionary of single characters and their indexes but for list the output of word_index is dictionary of words (spllited by space).
Why this happen?
String for tokenizer input:
from keras.preprocessing.text import Tokenizer
text = "Keras is a deep learning and neural networks API by François Chollet"
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text) #input of tokenizer as string
print(tokenizer.word_index)
>>> {'e': 1, 'a': 2, 'n': 3, 'r': 4, 's': 5, 'i': 6, 'l': 7, 'o': 8, 'k': 9, 'd': 10, 'p': 11, 't': 12, 'g': 13,
'u': 14, 'w': 15, 'b': 16, 'y': 17, 'f': 18, 'ç': 19, 'c': 20, 'h': 21}
List for tokenizer input:
from keras.preprocessing.text import Tokenizer
text = ["Keras is a deep learning and neural networks API by François Chollet"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text) #input of tokenizer as list
print(tokenizer.word_index)
>>> {'keras': 1, 'is': 2, 'a': 3, 'deep': 4, 'learning': 5, 'and': 6, 'neural': 7, 'networks': 8,
'api': 9, 'by': 10, 'françois': 11, 'chollet': 12}
The docs state to use a list of strings or a list of list of strings. There is no mention of whether you are allowed to pass a string as input, so it's possible that what you're doing is undefined behaviour that isn't getting caught.
When you pass a string as input, it looks like Keras interprets it to be a character level tokenization. Either way, if you wanted to perform a character level tokenization, it's much better to pass char_level=True when you are instantiating the Tokenizer class.
TL;DR: Don't pass a string. The docs don't mention it as a legal argument. There exists a legal way of performing character level tokenization.

Efficient way to write this expression: English alphabets dictionary

What would be an efficient and the right way to implement this expression?
{'a': 1, 'b': 2 ... 'z': 26}
I have tried:
x = dict(zip(chr(range(ASCII of A, ASCII of Z)))
Something like this? But I can't figure out the correct expression.
>>> from string import lowercase
>>> dict((j,i) for i,j in enumerate(lowercase, 1))
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9, 'h': 8, 'k': 11, 'j': 10, 'm': 13, 'l': 12, 'o': 15, 'n': 14, 'q': 17, 'p': 16, 's': 19, 'r': 18, 'u': 21, 't': 20, 'w': 23, 'v': 22, 'y': 25, 'x': 24, 'z': 26}
enumerate(lowercase) returns this sequence (0, 'a'), (1, 'b'), (2, 'c'),...
by adding the optional parameter, enumerate starts at 1 instead of 0
enumerate(lowercase, 1) returns this sequence (1, 'a'), (2, 'b'), (3, 'c'),...
The optional parameter is not supported by python older than 2.6, so you could write it this way instead
>>> dict((j,i+1) for i,j in enumerate(lowercase))
dict((chr(x + 96), x) for x in range(1, 27))
You are on the right track, but notice that zip requires a sequence.
So this is what you need:
alphabets = dict(zip([chr(x) for x in range(ord('a'), ord('z')+1)], range(1, 27)))
ord returns the integer ordinal of a one character string. So you can't do a chr(sequence) or an ord(sequence). It has to be a single character, or a single number.
I'm not sure of an exact implementation, but wouldn't it make sense to use the ASCII codes to your advantage as they're in order? Specify the start and end then loop through them adding the ASCII character and the ASCII code minus the starting point.
dictionary comprehension:
{chr(a + 96):a for a in range(1,27)}
>>> {chr(a + 96):a for a in range(1,27)}
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9, 'h': 8, 'k': 11, 'j': 10, 'm': 13, 'l': 12, 'o': 15, 'n': 14, 'q': 17, 'p': 16, 's': 19, 'r': 18, 'u': 21, 't': 20, 'w': 23, 'v': 22, 'y': 25, 'x': 24, 'z': 26}
this only works in versions of python that support dictionary comprehensions, e.g. 3.x and i think 2.7
Guess I didn't reat the question closely enough. Fixed
dict( (chr(x), x-ord('a') +1 ) for x in range(ord('a'), ord('z')+1))
Is a dictionary lookup really what you want?
You can just have a function that does this:
def getNum(ch):
return ord(ch) - ord('a') + 1
This is pretty simple math, so it is possibly more efficient than a dictionary lookup, because the string doesn't need to be hashed and compared.
To do a dictionary lookup, the key you are looking for needs to be hashed, then it needs to find where that hash is in the dictionary. Next, it has to compare the key to the key it found to determine if it is the same or if it is a hash collision. Then, it has to read the value at that location.
The function just needs to do a couple additions. It does have the overhead of a function call though, so that may make it less efficient than a dictionary lookup.
Another thing you may need to consider is what each solution does if the input is invalid (not 'a' - 'z', for example capital 'A'). The dictionary solution would raise a KeyError. You could add code to catch errors if you used a function. If you were to use 'A' with the in-place solution you would get a wrong result, but no error would be raised indicating that you had invalid input.
The point is that in addition to asking "What would be an efficient way to implement this expression?", you should also be asking (at least asking yourself) is "Is this expression really what I want?" and "Is the more efficiency worth the trade-offs?".

Categories

Resources