How do I secure my pickle files correctly? - python

I'm following this guide to secure the pickle files correctly but I'm not getting the same output. Granted I had to do some changes to run it the first time:
import hashlib
import hmac
import pickle
class Dummy:
pass
obj = Dummy()
data = pickle.dumps(obj)
digest = hmac.new(b'unique-key-here', data, hashlib.blake2b).hexdigest()
with open('temp.txt', 'wb') as output:
output.write(str(digest) + ' ' + data)
with open('temp.txt', 'r') as f:
data = f.read()
digest, data = data.split(' ')
expected_digest = hmac.new(b'unique-key-here', data, hashlib.blake2b).hexdigest()
if not secrets.compare_digest(digest, expected_digest):
print('Invalid signature')
exit(1)
obj = pickle.loads(data)
When I run this I get the following stacktrace:
File "test.py", line 21, in <module>
expected_digest = hmac.new(b'unique-key-here', data, hashlib.blake2b).hexdigest()
File "/usr/lib/python3.8/hmac.py", line 153, in new
return HMAC(key, msg, digestmod)
File "/usr/lib/python3.8/hmac.py", line 88, in __init__
self.update(msg)
File "/usr/lib/python3.8/hmac.py", line 96, in update
self.inner.update(msg)
TypeError: Unicode-objects must be encoded before hashing

Your problem is data = f.read(). .read() returns a string and hmac.new() wants bytes. Change the problem line to data = f.read().encode('utf-8') OR read the file in binary mode ('b' flag).
References:
7.2. Reading and Writing Files
open()
hmac.new()
.encode()

I ended up having to use the following methods for it to work:
pickle.loads(codecs.decode(pickle_data.encode(), 'base64'))
# and
codecs.encode(pickle.dumps(pickle_obj), "base64").decode()
Not sure why using .encode() and .decode() was still not working for me.

Related

serialize a text file into a protobuf message

I have a serialized protobuf message that I can simply read and save in plain text in python with something like this:
import MyMessage
import sys
FilePath = sys.argv[1]
T = MyMessage.MyType()
f = open(FilePath, 'rb')
T.ParseFromString(f.read())
f.close()
print(T)
I can save this to a plain txt file and do what I want to do.
Now I need to do the inverse operation, i.e. reading the simple plain text file, already formatted in the right way, and save it as a protobuf message
import MyMessage
import sys
FilePath = sys.argv[1]
input = open("./input.txt", 'r')
T = MyMessage.MyType()
T.ParseFrom(inputText.readlines())
output.write(T.SerializeToString())
input.close()
output.close()
This fails with
Traceback (most recent call last):
File "MyFile.py", line 13, in <module>
T.ParseFromString(input.readlines())
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\google\protobuf\message.py", line 199, in ParseFromString
return self.MergeFromString(serialized)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\google\protobuf\internal\python_message.py", line 1142, in MergeFromString
serialized = memoryview(serialized)
TypeError: memoryview: a bytes-like object is required, not 'list'
I am not a python nor a protobuf expert, so I guess I am missing something trivial...
Any help?
Thanks :)
print(x) calls str(x), which for protobufs uses the human-readable "text format" representation.
To read back from that format, you can use the google.protobuf.text_format module:
from google.protobuf import text_format
def parse_my_type(file_path):
with open(file_path, 'r') as f:
return text_format.Parse(f.read(), MyMessage.MyType())

python tempfile + gzip + json dump

I want to dump very large dictionary in to a compressed json file using python3 (3.5).
import gzip
import json
import tempfile
data = {"verylargedict": True}
with tempfile.NamedTemporaryFile("w+b", dir="/tmp/", prefix=".json.gz") as fout:
with gzip.GzipFile(mode="wb", fileobj=fout) as gzout:
json.dump(data, gzout)
I got this error though.
Traceback (most recent call last):
File "test.py", line 13, in <module>
json.dump(data, gzout)
File "/usr/lib/python3.5/json/__init__.py", line 179, in dump
fp.write(chunk)
File "/usr/lib/python3.5/gzip.py", line 258, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
Any thoughts?
Gzip object has no text mode. So I would create a wrapper to pass as the filehandle object. This wrapper takes data from json and encodes it as binary to write in the gzip file:
class wrapper:
def __init__(self,gzout):
self.__handle = gzout
def write(self,data):
self.__handle.write(data.encode())
use like this:
json.dump(data, wrapper(gzout))
each time json.dump wants to write to the object, the wrapper.write method is called, which converts text to binary and writes to the binary stream
(some built-in wrappers from io module may fit too, but this implementation is simple and work)

Read a large text file and write to another file with Python

I am trying to convert a large text file (size of 5 gig+) but got a
From this post, I managed to convert encoding format of a text file into a format that is readable with this:
path ='path/to/file'
des_path = 'path/to/store/file'
for filename in os.listdir(path):
with open('{}/{}'.format(path, filename), 'r+', encoding='iso-8859-11') as f:
t = open('{}/{}'.format(des_path, filename), 'w')
string = f.read()
t.write(string)
t.close()
The problem here is that when I tried to convert a text file with a large size(5 GB+). I will got this error
Traceback (most recent call last):
File "Desktop/convertfile.py", line 12, in <module>
string = f.read()
File "/usr/lib/python3.6/encodings/iso8859_11.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
MemoryError
which I know that it cannot read a file with this large. And I found from several link that I can do it by reading line by line.
So, how can I apply to the code I have to make it read line by line? What I understand about reading line by line here is that I need to read a line from f and add it to t until end of the line, right?
You can iterate on the lines of an open file.
for filename in os.listdir(path):
inp, out = open_files(filename):
for line in inp:
out.write(line)
inp.close(), out.close()
Note that I've hidden the complexity of the different paths, encodings, modes in a function that I suggest you to actually write...
Re buffering, i.e. reading/writing larger chunks of the text, Python does its own buffering undercover so this shouldn't be too slow with respect to a more complex solution.

"TypeError: object not callable" while using tokenize.detect_encoding()

I'm reading a bunch of txt.gz files but they have different encoding (at least UTF-8 and cp1252, they are old dirty files). I try to detect the encoding of fIn before reading it in text-mode but I get the error: TypeError: 'GzipFile' object is not callable
The corresponding code:
# detect encoding
with gzip.open(fIn,'rb') as file:
fInEncoding = tokenize.detect_encoding(file) #this doesn't works
print(fInEncoding)
for line in gzip.open(fIn,'rt', encoding=fInEncoding[0], errors="surrogateescape"):
if line.find("From ") == 0:
if lineNum != 0:
out.write("\n")
lineNum +=1
line = line.replace(" at ", "#")
out.write(line)
Traceback
$ ./mailmanToMBox.py list-cryptography.metzdowd.com
('Converting ', '2015-May.txt.gz', ' to mbox format')
Traceback (most recent call last):
File "./mailmanToMBox.py", line 65, in <module>
main()
File "./mailmanToMBox.py", line 27, in main
if not makeMBox(inFile,outFile):
File "./mailmanToMBox.py", line 48, in makeMBox
fInEncoding = tokenize.detect_encoding(file.readline()) #this doesn't works
File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 423, in detect_encoding
first = read_or_stop()
File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 381, in read_or_stop
return readline()
TypeError: 'bytes' object is not callable
EDIT
I tried to use the following code:
# detect encoding
readsource = gzip.open(fIn,'rb').__next__
fInEncoding = tokenize.detect_encoding(readsource)
print(fInEncoding)
I have no error but it always return utf-8 even when it isn't. My text editor (sublime) detect correctly the cp1252 encoding.
As the documentation of detect_encoding() says, it's input parameter has to be a callable that provides lines of input. That's why you get a TypeError: 'GzipFile' object is not callable.
import tokenize
with open(fIn, 'rb') as f:
codec = tokenize.detect_encoding(f.readline)[0]
... codec will be "utf-8" or something like that.

Python 3 JSON writing to CSV Error

I am trying to write out a csv file from data in JSON format. I can get the fieldnames to write to the csv file but not the item value I need. This is my first time coding in python so any help would be appreciated. The json file can be found below for reference:
https://data.ny.gov/api/views/nqur-w4p7/rows.json?accessType=DOWNLOAD
Here is my error:
Traceback (most recent call last):
File "ChangeDataType.py", line 5, in <module>
data = json.dumps(inputFile)
File "/usr/lib64/python3.4/json/__init__.py", line 230, in dumps
return _default_encoder.encode(obj)
File "/usr/lib64/python3.4/json/encoder.py", line 192, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib64/python3.4/json/encoder.py", line 250, in iterencode
return _iterencode(o, 0)
File "/usr/lib64/python3.4/json/encoder.py", line 173, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <_io.TextIOWrapper name='rows.json?accessType=DOWNLOAD' mode='r' encoding='UTF-8'> is not JSON serializable
Here is my code:
import json
import csv
inputFile = open("rows.json?accessType=DOWNLOAD", "r")
data = json.dumps(inputFile)
with open("Data.csv","w") as csvfile:
writer = csv.DictWriter(csvfile, extrasaction='ignore', fieldnames=["date", "new_york_state_average_gal", "albany_average_gal", "binghamton_average_gal",\
"buffalo_average_gal", "nassau_average_gal", "new_york_city_average_gal", "rochester_average_gal", "syracuse_average_gal","utica_average_gal"])
writer.writeheader()
for row in data:
writer.writerow([row["date"], row["new_york_state_average_gal"], row["albany_average_gal"], row["binghamton_average_gal"],\
row["buffalo_average_gal"], row["nassau_average_gal"], row["new_york_city_average_gal"], row["rochester_average_gal"], row["syracuse\
_average_gal"],row["utica_average_gal"]])
If you want to read a JSON file you should use json.load instead of json.dumps:
data = json.load(inputFile)
Seems you're still having problems even opening the file.
Python json to CSV
You were told to use json.load
dumps takes an object to a string. You want to read JSON to a dictionary.
You therefore need to load the JSON file, and you can open two files at once
with open("Data.csv","w") as csvfile, open("rows.json?accessType=DOWNLOAD") as inputfile:
data = json.load(inputfile)
writer = csv.DictWriter(csvfile,...
Also, for example, considering the JSON data looks like "fieldName" : "syracuse_average_gal", and that is the only occurrence of the Syracuse average value, row["syracuse_average_gal"] is not correct.
Carefully inspect your JSON and figure out to parse it from the very top bracket

Categories

Resources