MXNet parameter serialisation with numpy - python

I want to use a pre-trained MXNet model on s390x architecture but it doesn't seem to work. This is because the pre-trained models are in little-endian whereas s390x is big-endian. So, I'm trying to use https://numpy.org/devdocs/reference/generated/numpy.lib.format.html which works on both little-endian as well as big-endian.
One way to solve this is to I've found is to load the model parameters on an x86 machine, call asnumpy, save through numpy Then load the parameters on s390x machine using numpy and convert them to MXNet. But I'm not really sure how to code it. Can anyone please help me with that?
UPDATE
It seems the question is unclear. So, I'm adding an example that better explains what I want to do in 3 steps -
Load a preexisting model from MXNet, something like this -
net = mx.gluon.model_zoo.vision.resnet18_v1(pretrained=True, ctx=mx.cpu())
Export the model. The following code saves the model parameters in .param file. But this .param binary file has endian issues. So, instead of directly saving the model using mxnet API, I want to save the parameters file using numpy - https://numpy.org/devdocs/reference/generated/numpy.lib.format.html. Because using numpy, would make the binary file (.npy) endian independent. I am not sure how can I convert the parameters of MXNet model into numpy format and save them.
gluon.contrib.utils.export(net, path="./my_model")
Load the model. The following code loads the model from .param file.
net = gluon.contrib.utils.import(symbol_file="my_model-symbol.json",
param_file="my_model-0000.params",
ctx = 'cpu')
Instead of loading using the MXNet API, I want to use numpy to load .npy file that we created in step 2. After we have loaded the .npy file, we need to convert it to MXNet. So, I can finally use the model in MXNet.

Starting from the code snippets posted in the other question, Save/Load MXNet model parameters using NumPy :
It appears that mxnet has an option to store data internally as numpy arrays:
mx.npx.set_np(True, True)
Unfortunately, this option doesn't do what it I hoped (my IPython session crashed).
The parameters are a dict of mxnet.gluon.parameter.Parameter instances, each of them containing attributes of other special datatypes. Disentangling this so that you can store it as a large number of pure numpy arrays (or a collection of them in an .npz file) is a hopeless task.
Fortunately, python has pickle to convert complex data structures into something more or less portable:
# (mxnet/resnet setup skipped)
parameters = resnet.collect_params()
import pickle
with open('foo.pkl', 'wb') as f:
pickle.dump(parameters, f)
To restore the parameters:
with open('foo.pkl', 'rb') as f:
parameters_loaded = pickle.load(f)
Essentially, it looks like resnet.save_parameters() as defined in mxnet/gluon/block.py gets the parameters (using _collect_parameters_with_prefix()) and writes them to a file using a custom write function which appears to be compiled from C (I didn't check the details).
You can save the parameters using pickle instead.
For loading, load_parameters (also in util.py) contains this code (with sanity checks removed):
for name in loaded:
params[name]._load_init(loaded[name], ctx, cast_dtype=cast_dtype, dtype_source=dtype_source)
Here, loaded is a dict as loaded from the file. From examining the code, I don't fully grasp exactly what is being loaded - params seems to be a local variable in the function that is not used anymore. But it's worth a try to start from here, by writing a replacement for the load_parameters function. You can "monkey-patch" a function into an existing class by defining a function outside the class like this:
def my_load_parameters(self, ...):
... (put your modified implementation here)
mx.gluon.Block.load_parameters = my_load_parameters
Disclaimers/warnings:
even if you get save/load via pickle to work on a single big-endian system, it's not guaranteed to work between different-endian systems. The pickle protocol itself is endian-neutral, but if floating-point values (deep inside the mxnet.gluon.parameter.Parameter were stored as a raw data buffer in machine-endian convention, then pickle is not going to magically guess that groups of 8 bytes in the buffer need to be reversed. I think numpy arrays are endian-safe when pickled.
Pickle is not very robust if the underlying class definitions change between pickling and unpickling.
Never unpickle untrusted data.

Related

How do I implement a pure Python Pillow image plugin?

I'm trying to write an encoder for a special image format. My intention is to implement it in pure Python as an image plugin for Pillow.
There is an entry in its docs (Writing Your Own Image Plugin) that gives some hints as to how to implement the decoder. With that and some help I got at the project's issues section I managed to get a decoder working.
However, all their encoders are written in C and the ImageFile.PyEncoder superclass (from which all Python encoder classes should inherit) is not even implemented. The documentation is also very sparse in this respect.
Given this state of affairs, is it possible to get such an encoder working? If so, I'd like to know what methods to write, where to get the image data from and where to write the encoded result to.
Related issues:
Writing your own image plugin documentation missing information;
PyEncoder doesn't exist
Edit 1: I'm not looking for a detailed encoder implementation. It's just that the docs don't state what the structure should be in Python. If it serves as proof of work, here is a repo with my work so far.
I managed to get an encoder working by manually implementing the encoder class. The required public methods are:
__init__: sets self.mode and gets the parameters for the encoder, if any;
(#property) pushes_fd: simply return self._pushes_fd;
setimage: sets self.im (the Pillow image object) and gets the image size;
setfd: sets self.fd (the file object to write to);
encode_to_pyfd: this is the method that you have to write. Read your input data from self.im* and write it to self.fd;
cleanup: perform decoder specific cleanup (just pass works).
The class also needs to set _pushes_fd = True.
Though a bit of a kludge, it was easier than I expected.
*I found it better to work with list[self.im] instead of self.im directly. The former allows slicing and all other list operations.

use python's sklearn module with custom dataset

I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.
I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.
Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)
How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?
In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.
Thanks
You can load whatever you want and then use sklearn models.
If you have a .csv file, pandas would be the best option.
import pandas as pd
mydataset = pd.read_csv("dataset.csv")
X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)
Similarily, you can load .txt or .xls files.
The important thing in order to use sklearn models is this:
X should be always be an 2D array with shape [n_samples, n_variables]
y should be the target varible.

How to understand python pickle files

I've a problem that I am not able to solve. I'm studying a Convolutional Neural Network for traffic sign recognition but I don't understand how python .p files are organized, what these files contain and how to create a .p file to insert my images and labels associated with such images. Can anyone help me? I posted the link of the screenshot about the first lines of code that load the data in the dataset. Thanks a lot.
import pickle
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import cv2
import scipy.misc
from PIL import Image
training_file = 'traffic-signs-data/train.p'
validation_file = 'traffic-signs-data/valid.p'
testing_file = 'traffic-signs-data/test.p'
with open(training_file, mode='rb') as f:
train = pickle.load(f)
with open(validation_file, mode='rb') as f:
valid = pickle.load(f)
with open(testing_file, mode='rb') as f:
test = pickle.load(f)
X_train, y_train = train['features'], train['labels']
X_valid, y_valid = valid['features'], valid['labels']
X_test, y_test = test['features'], test['labels']
pickle isn't just one thing, so there's no single answer to your question, but the whole point of pickle is that it shouldn't matter. In general, any python object can be pickled as-is, and unpickled, without any special knowledge of what's being pickled or how. It's simply a way to freeze an in-memory Python object on disk. It's up to you as the developer to know what data and types went into a pickle file and what you should expect back out, or to handle errors appropriately.
There are rare issues with certain types that don't make sense to be pickled (such as an HTTP connection), and also if you attempt to unpickle an old Python object after changing the underlying Python or library versions (e.g., trying to unpickle a Python 3 object in Python 2), but in general it doesn't matter what you're pickling. If you need greater resilience to change, you should use some serialization system that isn't Python- and library-specific.
This may be of interest to you: https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled
The thing is, if you do not have the classes that were used to pickle, you won't be able to unpickle.
So, your data in your .p files may be totally useless.
However, if you're the owner of the full flow (and you'll have to create the .p files), then, know that pickle is just a way to serialize/deserialize data.
So, you can have a piece of your software that focuses on populating your .p files (try loading your images with Image (and use pillow and not PIL) and then pickle your list of Images).
You should then be able to unpickle them on the part of your software that you're showing above.
This is just a way to do your preprocessing beforehand and avoid to redo them everytime.
Another way to do that (for example) would be to dump it as json and your images can be base64 encoded/decoded. See here for a quick example of the latter part: https://stackoverflow.com/a/31826470/8933502
Good luck!

HDF5: alias for data-keys in caffe

I'm trying to load HDF5 data for my caffe-net as input data. So far I can create those hdf5-databases and the list file. I also read the simple example in https://ceciliavision.wordpress.com/2016/03/21/caffe-hdf5-layer/. There is stated that the dataset keys are also the names of the data-layer in caffe. But what I want is a to give them kind of aliases in caffe, is that possible? The reason is that I have two hdf5-DB with the same dataset-structure inside, so also the same dataset-keys. Is there a name-clash if I load both hdf5-DB in the same net and if so can I change the names without changing the hdf5-DB itself?
Thanks for help!

Queue input data with tensorflow or skflow

I am training a neural net with a DataFeeder which is a bit slow (because it reads non-contiguous data from a h5 file); so the GPU satays idle (GPU-Util is at 0 %) half of the time.
Is there a way, in either TensorFlow or skflow, to have multiple DataFeeders running in parallel, to avoid this bottleneck?
Tensorflow has reader library that can in parallel (and in C++) read and queue data. This should remove bottleneck you are talking about.
We are currently (this/next week) adding it's support to tf.learn (new name for skflow) to make it easy to use. You will still need to convert your data into one of the formats readers support (fixed len vectors, Example proto).
If you want to try make it work yourself - you can create a separate DataFeeder, that would use the ops from reader library in input_builder function and return no-op in the get_feed_dict_fn.

Categories

Resources