Why np.load() couldn't read my ndarray data in pickled file?

Why np.load() couldn't read my ndarray data in pickled file? - python

I am trying to analyze a tensor data, but I could not read the data in picked file by using np.load(). My python code is as follows:
import pickle
import numpy as np
import sktensor as skt
import numpy.random as rn
data = np.ones((10, 8, 3), dtype='int32') # 3-mode count tensor of size 10 x 8 x 3
##data = skt.dtensor(data)
with open('data.dat', 'w+') as f: # can be stored as a .dat using pickle
pickle.dump(data, f)
with open('data.dat', 'r+') as f: # can be loaded back in using pickle.load
tmp = pickle.load(f)
assert np.allclose(tmp, data)
But when I attempted to use np.load() to load the data in data.bat as follows:
np.load('G:\data.dat')
Some error appears as"
Traceback (most recent call last):
File "<pyshell#34>", line 1, in <module>
np.load('D:/GDELT_Tensor/data.dat', mmap_mode = 'r')
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 416, in load
"Failed to interpret file %s as a pickle" % repr(file))
IOError: Failed to interpret file 'D:/data.dat' as a pickle.
Anyone can help me?

Don't use the pickle module to save NumPy arrays. Instead, use one of the methods here: http://docs.scipy.org/doc/numpy/reference/routines.io.html
There's even one that uses pickle under the hood, for example:
np.save('data.dat', data)
tmp = np.load('data.dat')
Another format like CSV or HDF5 might be more suitable for most applications--especially where you might want to interoperate with non-Python systems.

Related

Reading ARFF from ZIP with zipfile and scipy.io.arff

I want to process quite big ARFF files in scikit-learn. The files are in a zip archive and I do not want to unpack the archive to a folder before processing. Hence, I use the zipfile module of Python 3.6:
from zipfile import ZipFile
from scipy.io.arff import loadarff
archive = ZipFile( 'archive.zip', 'r' )
datafile = archive.open( 'datafile.arff' )
data = loadarff( datafile )
# …
datafile.close()
archive.close()
However, this yields the following error:
Traceback (most recent call last):
File "./m.py", line 6, in <module>
data = loadarff( datafile )
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 541, in loadarff
return _loadarff(ofile)
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 550, in _loadarff
rel, attr = read_header(ofile)
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 323, in read_header
while r_comment.match(i):
TypeError: cannot use a string pattern on a bytes-like object
According to loadarff documentation, loadarff requires a file-like object.
According to zipfile documentation, open returns a file-like ZipExtFile.
Hence, my question is how to use what ZipFile.open returns as the ARFF input to loadarff.
Note: If I unzip manually and load the ARFF directly with data = loadarff( 'datafile.arff' ), all is fine.

from io import BytesIO, TextIOWrapper
from zipfile import ZipFile
from scipy.io.arff import loadarff
zfile = ZipFile('archive.zip', 'r')
in_mem_fo = TextIOWrapper(BytesIO(zfile.read('datafile.arff')), encoding='utf-8')
data = loadarff(in_mem_fo)
Read zfile into a in-memory BytesIO object. Use TextIOWrapper with encoding='utf-8'. Use this in-memory buffered text object in loadarff.
Edit: Turnsout zfile.open() returns a file-like object so the above can be accomplished by :
zfile = ZipFile('archive.zip', 'r')
in_mem_fo = TextIOWrapper(zfile.open('datafile.arff'), encoding='ascii')
data = loadarff(in_mem_fo)
Thanks #Bernhard

How to concatenate many binary file in one file in python?

I need to append many binary files in one binary file. All my binary files are saved i one folder:
file1.bin
file2.bin
...
For that I try by using this code:
import numpy as np
import glob
import os
Power_Result_File_Path ="/home/Deep_Learning_Based_Attack/Test.bin"
Folder_path =r'/home/Deep_Learning_Based_Attack/Test_Folder/'
os.chdir(Folder_path)
npfiles= glob.glob("*.bin")
loadedFiles = [np.load(bf) for bf in binfiles]
PowerArray=np.concatenate(loadedFiles, axis=0)
np.save(Power_Result_File_Path, PowerArray)
It gives me this error:
"Failed to interpret file %s as a pickle" % repr(file))
OSError: Failed to interpret file 'file.bin' as a pickle
My problem is how to concatenate binary file it is not about anaylysing every file indenpendently.

Taking your question literally: Brute raw data concatenation
files = ['my_file1', 'my_file2']
out_data = b''
for fn in files:
with open(fn, 'rb') as fp:
out_data += fp.read()
with open('the_concatenation_of_all', 'wb') as fp:
fp.write(out_data)
Comment about your example
You seem to be interpreting the files as saved numpy arrays (i.e. saved via np.save()). The error, however, tells me that you didn't save those files via numpy (because it fails decoding them). Numpy uses pickle to save and load, so if you try to open a random non-pickle file with np.load the call will throw an error.

for file in files:
async with aiofiles.open(file, mode='rb') as f:
contents = await f.read()
if file == files[0]:
write_mode = 'wb' # overwrite file
else:
write_mode = 'ab' # append to end of file
async with aiofiles.open(output_file), write_mode) as f:
await f.write(contents)

reading gzipped csv file in python 3

I'm having problems reading from a gzipped csv file with the gzip and csv libs. Here's what I got:
import gzip
import csv
import json
f = gzip.open(filename)
csvobj = csv.reader(f,delimiter = ',',quotechar="'")
for line in csvobj:
ts = line[0]
data_json = json.loads(line[1])
but this throws an exception:
File "C:\Users\yaronol\workspace\raw_data_from_s3\s3_data_parser.py", line 64, in download_from_S3
self.parse_dump_file(filename)
File "C:\Users\yaronol\workspace\raw_data_from_s3\s3_data_parser.py", line 30, in parse_dump_file
for line in csvobj:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
gunzipping the file and opening that with csv works fine. I've also tried decoding the file text to convert from bytes to str...
What am I missing here?

Default mode for gzip.open is rb, if you wish to work with strs, you have to specify it extra:
f = gzip.open(filename, mode="rt")
OT: it is a good practice to write I/O operations in a with block:
with gzip.open(filename, mode="rt") as f:

You are opening the file in binary mode (which is the default for gzip).
Try instead:
import gzip
import csv
f = gzip.open(filename, mode='rt')
csvobj = csv.reader(f,delimiter = ',',quotechar="'")

too late, you can use datatable package in python
import datatable as dt
df = dt.fread(filename)
df.head()

How to unpack pkl file?

I have a pkl file from MNIST dataset, which consists of handwritten digit images.
I'd like to take a look at each of those digit images, so I need to unpack the pkl file, except I can't find out how.
Is there a way to unpack/unzip pkl file?

Generally
Your pkl file is, in fact, a serialized pickle file, which means it has been dumped using Python's pickle module.
To un-pickle the data you can:
import pickle
with open('serialized.pkl', 'rb') as f:
data = pickle.load(f)
For the MNIST data set
Note gzip is only needed if the file is compressed:
import gzip
import pickle
with gzip.open('mnist.pkl.gz', 'rb') as f:
train_set, valid_set, test_set = pickle.load(f)
Where each set can be further divided (i.e. for the training set):
train_x, train_y = train_set
Those would be the inputs (digits) and outputs (labels) of your sets.
If you want to display the digits:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
plt.imshow(train_x[0].reshape((28, 28)), cmap=cm.Greys_r)
plt.show()
The other alternative would be to look at the original data:
http://yann.lecun.com/exdb/mnist/
But that will be harder, as you'll need to create a program to read the binary data in those files. So I recommend you to use Python, and load the data with pickle. As you've seen, it's very easy. ;-)

Handy one-liner
pkl() (
python -c 'import pickle,sys;d=pickle.load(open(sys.argv[1],"rb"));print(d)' "$1"
)
pkl my.pkl
Will print __str__ for the pickled object.
The generic problem of visualizing an object is of course undefined, so if __str__ is not enough, you will need a custom script.

In case you want to work with the original MNIST files, here is how you can deserialize them.
If you haven't downloaded the files yet, do that first by running the following in the terminal:
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Then save the following as deserialize.py and run it.
import numpy as np
import gzip
IMG_DIM = 28
def decode_image_file(fname):
result = []
n_bytes_per_img = IMG_DIM*IMG_DIM
with gzip.open(fname, 'rb') as f:
bytes_ = f.read()
data = bytes_[16:]
if len(data) % n_bytes_per_img != 0:
raise Exception('Something wrong with the file')
result = np.frombuffer(data, dtype=np.uint8).reshape(
len(bytes_)//n_bytes_per_img, n_bytes_per_img)
return result
def decode_label_file(fname):
result = []
with gzip.open(fname, 'rb') as f:
bytes_ = f.read()
data = bytes_[8:]
result = np.frombuffer(data, dtype=np.uint8)
return result
train_images = decode_image_file('train-images-idx3-ubyte.gz')
train_labels = decode_label_file('train-labels-idx1-ubyte.gz')
test_images = decode_image_file('t10k-images-idx3-ubyte.gz')
test_labels = decode_label_file('t10k-labels-idx1-ubyte.gz')
The script doesn't normalize the pixel values like in the pickled file. To do that, all you have to do is
train_images = train_images/255
test_images = test_images/255

The pickle (and gzip if the file is compressed) module need to be used
NOTE: These are already in the standard Python library.
No need to install anything new

Python Pickle EOFerror when using Pickler (but not with pickle.dump())

So, I'm trying to save some objects to disk on Windows 7 using Python's pickle. I'm using the code below, which fails on pretty much any arbitrary object (the contents of saveobj aren't important, it fails regardless). Below is my test code:
import pickle, os, time
outfile = "foo.pickle"
f = open(outfile, 'wb')
p = pickle.Pickler(f, -1)
saveobj = ( 2,3,4,5,["hat", {"mat": 6}])
p.save(saveobj)
#pickle.dump(saveobj, f)
print "done pickling"
f.close()
g = open(outfile, 'rb')
tup = pickle.load(g)
g.close()
print tup
When I run it, I get the following output/error:
done pickling
Traceback (most recent call last):
File "C:\Users\user\pickletest2.py", line 13, in <module>
tup = pickle.load(g)
File "C:\Python26\lib\pickle.py", line 1370, in load
return Unpickler(file).load()
File "C:\Python26\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\Python26\lib\pickle.py", line 880, in load_eof
raise EOFError
EOFError
However, if I use pickle.dump() instead of a Pickler object, it works just fine. My reason for using Pickler is that I would like to subclass it so I can perform operations on each object before I pickle it.
Does anybody know why my code is doing this? My searching has revealed that not having 'wb' and 'rb' commonly cause this, as does not having f.close(), but I have both of those. Is it a problem with using -1 as the protocol? I'd like to keep it, as it can handle objects which define their own __slots__ methods without defining a __getstate__ method.

Pickler.save() is a lower level method, that you're not supposed to call directly.
If you call p.dump(saveobj) instead of p.save(saveobj), it works as expected.
Perhaps it should be called _save to avoid confusion. But dump is the method described in the documentation, and it neatly matches up with the module-level pickle.dump.

In general it is better to use cPickle for performance reasons (since cPickle is written in C).
Anyway, using dump it works just fine:
import pickle
import os, time
outfile = "foo.pickle"
f = open(outfile, 'wb')
p = pickle.Pickler(f, -1)
saveobj = ( 2,3,4,5,["hat", {"mat": 6}])
p.dump(saveobj)
#pickle.dump(saveobj, f)
f.close()
print "done pickling"
#f.close()
g = open(outfile, 'rb')
u = pickle.Unpickler(g) #, -1)
tup = u.load()
#tup = pickle.load(g)
g.close()
print tup

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why np.load() couldn't read my ndarray data in pickled file? - python

Related

Reading ARFF from ZIP with zipfile and scipy.io.arff

How to concatenate many binary file in one file in python?

reading gzipped csv file in python 3

How to unpack pkl file?

Python Pickle EOFerror when using Pickler (but not with pickle.dump())

Categories

Resources