How to understand python pickle files

How to understand python pickle files - python

I've a problem that I am not able to solve. I'm studying a Convolutional Neural Network for traffic sign recognition but I don't understand how python .p files are organized, what these files contain and how to create a .p file to insert my images and labels associated with such images. Can anyone help me? I posted the link of the screenshot about the first lines of code that load the data in the dataset. Thanks a lot.
import pickle
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import cv2
import scipy.misc
from PIL import Image
training_file = 'traffic-signs-data/train.p'
validation_file = 'traffic-signs-data/valid.p'
testing_file = 'traffic-signs-data/test.p'
with open(training_file, mode='rb') as f:
train = pickle.load(f)
with open(validation_file, mode='rb') as f:
valid = pickle.load(f)
with open(testing_file, mode='rb') as f:
test = pickle.load(f)
X_train, y_train = train['features'], train['labels']
X_valid, y_valid = valid['features'], valid['labels']
X_test, y_test = test['features'], test['labels']

pickle isn't just one thing, so there's no single answer to your question, but the whole point of pickle is that it shouldn't matter. In general, any python object can be pickled as-is, and unpickled, without any special knowledge of what's being pickled or how. It's simply a way to freeze an in-memory Python object on disk. It's up to you as the developer to know what data and types went into a pickle file and what you should expect back out, or to handle errors appropriately.
There are rare issues with certain types that don't make sense to be pickled (such as an HTTP connection), and also if you attempt to unpickle an old Python object after changing the underlying Python or library versions (e.g., trying to unpickle a Python 3 object in Python 2), but in general it doesn't matter what you're pickling. If you need greater resilience to change, you should use some serialization system that isn't Python- and library-specific.

This may be of interest to you: https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
The thing is, if you do not have the classes that were used to pickle, you won't be able to unpickle.
So, your data in your .p files may be totally useless.
However, if you're the owner of the full flow (and you'll have to create the .p files), then, know that pickle is just a way to serialize/deserialize data.
So, you can have a piece of your software that focuses on populating your .p files (try loading your images with Image (and use pillow and not PIL) and then pickle your list of Images).
You should then be able to unpickle them on the part of your software that you're showing above.
This is just a way to do your preprocessing beforehand and avoid to redo them everytime.
Another way to do that (for example) would be to dump it as json and your images can be base64 encoded/decoded. See here for a quick example of the latter part: https://stackoverflow.com/a/31826470/8933502
Good luck!

Related

MXNet parameter serialisation with numpy

I want to use a pre-trained MXNet model on s390x architecture but it doesn't seem to work. This is because the pre-trained models are in little-endian whereas s390x is big-endian. So, I'm trying to use https://numpy.org/devdocs/reference/generated/numpy.lib.format.html which works on both little-endian as well as big-endian.
One way to solve this is to I've found is to load the model parameters on an x86 machine, call asnumpy, save through numpy Then load the parameters on s390x machine using numpy and convert them to MXNet. But I'm not really sure how to code it. Can anyone please help me with that?
UPDATE
It seems the question is unclear. So, I'm adding an example that better explains what I want to do in 3 steps -
Load a preexisting model from MXNet, something like this -
net = mx.gluon.model_zoo.vision.resnet18_v1(pretrained=True, ctx=mx.cpu())
Export the model. The following code saves the model parameters in .param file. But this .param binary file has endian issues. So, instead of directly saving the model using mxnet API, I want to save the parameters file using numpy - https://numpy.org/devdocs/reference/generated/numpy.lib.format.html. Because using numpy, would make the binary file (.npy) endian independent. I am not sure how can I convert the parameters of MXNet model into numpy format and save them.
gluon.contrib.utils.export(net, path="./my_model")
Load the model. The following code loads the model from .param file.
net = gluon.contrib.utils.import(symbol_file="my_model-symbol.json",
param_file="my_model-0000.params",
ctx = 'cpu')
Instead of loading using the MXNet API, I want to use numpy to load .npy file that we created in step 2. After we have loaded the .npy file, we need to convert it to MXNet. So, I can finally use the model in MXNet.

Starting from the code snippets posted in the other question, Save/Load MXNet model parameters using NumPy :
It appears that mxnet has an option to store data internally as numpy arrays:
mx.npx.set_np(True, True)
Unfortunately, this option doesn't do what it I hoped (my IPython session crashed).
The parameters are a dict of mxnet.gluon.parameter.Parameter instances, each of them containing attributes of other special datatypes. Disentangling this so that you can store it as a large number of pure numpy arrays (or a collection of them in an .npz file) is a hopeless task.
Fortunately, python has pickle to convert complex data structures into something more or less portable:
# (mxnet/resnet setup skipped)
parameters = resnet.collect_params()
import pickle
with open('foo.pkl', 'wb') as f:
pickle.dump(parameters, f)
To restore the parameters:
with open('foo.pkl', 'rb') as f:
parameters_loaded = pickle.load(f)
Essentially, it looks like resnet.save_parameters() as defined in mxnet/gluon/block.py gets the parameters (using _collect_parameters_with_prefix()) and writes them to a file using a custom write function which appears to be compiled from C (I didn't check the details).
You can save the parameters using pickle instead.
For loading, load_parameters (also in util.py) contains this code (with sanity checks removed):
for name in loaded:
params[name]._load_init(loaded[name], ctx, cast_dtype=cast_dtype, dtype_source=dtype_source)
Here, loaded is a dict as loaded from the file. From examining the code, I don't fully grasp exactly what is being loaded - params seems to be a local variable in the function that is not used anymore. But it's worth a try to start from here, by writing a replacement for the load_parameters function. You can "monkey-patch" a function into an existing class by defining a function outside the class like this:
def my_load_parameters(self, ...):
... (put your modified implementation here)
mx.gluon.Block.load_parameters = my_load_parameters
Disclaimers/warnings:
even if you get save/load via pickle to work on a single big-endian system, it's not guaranteed to work between different-endian systems. The pickle protocol itself is endian-neutral, but if floating-point values (deep inside the mxnet.gluon.parameter.Parameter were stored as a raw data buffer in machine-endian convention, then pickle is not going to magically guess that groups of 8 bytes in the buffer need to be reversed. I think numpy arrays are endian-safe when pickled.
Pickle is not very robust if the underlying class definitions change between pickling and unpickling.
Never unpickle untrusted data.

use python's sklearn module with custom dataset

I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.
I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.
Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)
How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?
In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.
Thanks

You can load whatever you want and then use sklearn models.
If you have a .csv file, pandas would be the best option.
import pandas as pd
mydataset = pd.read_csv("dataset.csv")
X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)
Similarily, you can load .txt or .xls files.
The important thing in order to use sklearn models is this:
X should be always be an 2D array with shape [n_samples, n_variables]
y should be the target varible.

Loading Large Files as Dictionary

My first question on stackoverflow :)
I am trying to load a pertained vectors from Glove vectors and create a dictionary with words as keys and corresponding vectors as values. I did the usual naive method to this:
fp=open(wordEmbdfile)
self.wordVectors={}
# Create wordVector dictionary
for aline in fp:
w=aline.rstrip().split()
self.wordVectors[w[0]]=w[1:]
fp.close()
I see a huge memory pressure on Activity Monitor, and eventually after trying for an hour or two it crashes.
I am going to try splitting in multiple smaller files and create multiple dictionaries.
In the meantime I have following questions:
To read the word2vec file, is it better if I read the gzipped file using gzip.open or uncompress it and then read it with plain open.
The word vector file has text in first column and float in the rest, would it be more optimal to use genfromtext or loadtext from numpy?
I intend save this dictionary using chicle, I know loading it is going to be hard too. I read the suggestion to use shelve, how does it compare to cPickle in loading time and access time. May be its better to spend some more time loading with cPickle if improve future accesses (if cPickle does not crash, with 8G RAM), Does anyone have some suggestion on this?
Thanks!

cPickle very large amount of data

I have about 0.8 million images of 256x256 in RGB, which amount to over 7GB.
I want to use them as training data in a Convolutional Neural Network, and want to put them in a cPickle file, along with their labels.
Now, this is taking a lot of memory, to the extent that it needs to swap with my hard drive memory, and almost consume it all.
Is this is a bad idea?
What would be the smarter/more practical way to load into CNN or pickle them without causing too much memory issue?
This is what the code looks like
import numpy as np
import cPickle
from PIL import Image
import sys,os
pixels = []
labels = []
traindata = []
data=[]
for subdir, dirs, files in os.walk('images'):
curdir=''
for file in files:
if file.endswith(".jpg"):
floc=str(subdir)+'/'+str(file)
im= Image.open(floc)
pix=np.array(im.getdata())
pixels.append(pix)
labels.append(1)
pixels=np.array(pixels)
labels=np.array(labels)
traindata.append(pixels)
traindata.append(labels)
traindata=np.array(traindata)
.....# do the same for validation and test data
.....# put all data and labels into 'data' array
cPickle.dump(data,open('data.pkl','wb'))

Is this is a bad idea?
Yes, indeed.
You are trying to load 7GB of compressed image data into memory all at once (about 195 GB for 800k 256*256 RGB files). This will not work. You have to find a way to update your CNN image-by-image, saving the state as you go along.
Also consider how large your CCN parameter set will be. Pickle is not designed for large amounts of data. If you need to store GB worth of neural net data, you're much better off using a database. If the neural net parameter set is only a few MB, pickle will be fine, though.
You might also want to take a look at the documentation for pickle.HIGHEST_PROTOCOL, so you are not stuck with an old and unoptimized pickle file format.

Data compression in python/numpy

I'm looking at using the amazon cloud for all my simulation needs. The resulting sim files are quite large, and I would like to move them over to my local drive for ease of analysis, ect. You have to pay per data you move over, so I want to compress all my sim soutions as small as possible. They are simply numpy arrays saved in the form of .mat files, using:
import scipy.io as sio
sio.savemat(filepath, do_compression = True)
So my question is, what is the best way to compress numpy arrays (they are currently stored in .mat files, but I could store them using any python method), by using python compression saving, linux compression, or both?
I am in the linux environment, and I am open to any kind of file compression.

Unless you know something special about the arrays (e.g. sparseness, or some pattern) you aren't going to do much better than the default compression, and maybe gzip on top of that. In fact you may not even need to gzip the files if you're using HTTP for downloads and your server is configured to do compression. Good lossless compression algorithms rarely vary by more than 10%.
If savemat works as advertized you should be able to get gzip compression all in python with:
import scipy.io as sio
import gzip
f_out = gzip.open(filepath_dot_gz, 'wb')
sio.savemat(f_out, do_compression = True)

Also LZMA (AKA xz) gives very good compression on fairly sparse numpy arrays, albeit it is pretty slow when compressing (and may require more memory as well).
In Ubuntu it is installed with sudo apt-get install python-lzma
It is used as any other file-object wrapper, something like that (to load pickled data):
from lzma import LZMAFile
import cPickle as pickle
if fileName.endswith('.xz'):
dataFile = LZMAFile(fileName,'r')
else:
dataFile = file(fileName, 'ro')
data = pickle.load(dataFile)

Though it won't necessarily give you the highest compression ratios, I've had good experiences saving compressed numpy arrays to disk with python-blosc. It is very fast and integrates well with numpy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.